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Abstract 

This  thesis  explores  various  theoretical  aspects  of  machine  learning.  Particular  emphasis  is 
placed  on  techniques  for  designing  and  analyzing  computationally  efficient  learning  algorithms. 

Many  of  the  results  in  this  thesis  are  concerned  with  the  so-called  distribution-free  or  prob¬ 
ably  approximately  correct  (PAC)  model  of  learning  proposed  by  Valiant.  In  this  model,  the 
learner  tries  to  identify  an  unknown  concept  based  on  randomly  chosen  examples  of  the  concept. 
Examples  are  chosen  according  to  an  unchanging  but  unknown  and  arbitrary  distribution  on 
the  space  of  instances.  The  learner’s  task  is  to  find  a  hypothesis  or  prediction  rule  that,  with 
high  probability,  correctly  classifies  all  but  an  arbitrarily  small  fraction  of  the  instances. 

Following  a  brief  introduction,  this  thesis  begins  in  Chapter  2  with  a  study  of  the  problem 
of  improving  the  accuracy  of  a  hypothesis  output  by  a  learning  algorithm  in  this  model.  In 
particular,  it  is  shown  that  any  “weak"  learning  algorithm  that  performs  just  slightly  better 
than  random  guessing  can  be  converted  into  one  whose  error  can  be  made  arbitrarily  small. 
Among  the  many  consequences  of  this  result  is  a  technique  for  converting  any  PAC-learning 
algorithm  into  one  that  is  highly  space  efficient. 

In  Chapter  3,  we  next  explore  in  detail  a  simple  but  seemingly  powerful  technique  for 
discovering  the  structure  of  an  unknown  read-once  formula  from  random  examples.  The  method 
is  based  on  sampling  of  the  target  formula’s  statistical  behavior  under  various  perturbations 
of  the  underlying  instance-space  distribution.  An  especially  nice  feature  of  this  technique  is 
its  powerful  resistance  to  noise.  One  of  the  highlights  of  this  chapter  is  the  application  of  this 
technique  to  derive  the  first  polynomial-time  algorithm  for  learning  read-once  Boolean  formulas 
over  the  usual  basis  against  any  product  distribution  (i.e.,  any  distribution  in  which  the  setting 
of  each  variable  is  chosen  independently  of  the  settings  of  the  other  variables).  Algorithms  for 
various  other  classes  of  read-once  formulas  are  also  presented. 

We  next  consider  in  Chapter  4  a  realistic  extension  of  the  PAC  model  to  concepts  that 
may  exhibit  uncertain  or  probabilistic  behavior.  Such  probabilistic  concepts  arise  naturally  in 
many  situations,  such  as  weather  prediction,  where  the  measured  variables  and  their  accuracy 
are  insufficient  to  determine  the  outcome  with  certainty.  While  building  on  the  recent  results 
of  Haussler  on  the  sample  complexity  of  learning  in  probabilistic  settings,  this  chapter  focuses 
primarily  on  the  design  of  efficient  algorithms  for  learning  probabilistic  concepts.  This  work 
also  extends  many  of  the  results  in  the  standard  PAC  model  to  the  new  probabilistic  model. 

In  the  last  chapter,  we  present  new  algorithms  for  inferring  an  unknown  finite-state  automa¬ 
ton  from  its  input-output  behavior.  This  problem  is  motivated  by  the  problem  faced  by  a  robot 
in  unfamiliar  surroundings  who  must,  through  experimentation,  discover  the  “structure”  of  its 
environment.  Some  of  our  algorithms  are  based  on  Angluin’s  algorithm  for  learning  finite-state 
automata;  however,  unlike  her  procedure,  our  algorithms  are  effective  in  the  absence  of  a  means 
of  resetting  the  machine  to  a  start  state.  We  describe  provably  effective  learning  algorithms 
based  on  both  the  usual  state-based  representation,  and  the  diversity-based  representation  in- 
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troduced  by  Rivest  and  Schapire.  We  also  present  superior  algorithms  for  the  special  class  of 
permutation  automata. 

Portions  of  this  thesis  are  joint  work  with  Sally  A.  Goldman,  Michael  J.  Kearns  and 
Ronald  L.  Rivest. 

Thesis  Supervisor:  Ronald  L.  Rivest 
Title:  Professor  of  Computer  Science 
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Chapter  1 


Introduction 


The  final  objective  of  machine  learning  research  is  to  build  machines  that  learn  from  experience. 
For  example,  we  might  like  to  build  an  autonomous  robot  that  can  adapt  to  its  environment, 
and  can  teach  itself  to  walk,  navigate,  grasp  objects,  etc.  It  might  also  be  useful  to  have  a 
general  purpose  prediction  machine  that  can  study  large  data  bases,  recognize  patterns,  and 
based  on  those  patterns  make  predictions  about  the  future.  Such  a  machine  might  be  useful,  for 
instance,  as  a  tool  for  predicting  earthquakes,  rendering  medical  diagnoses,  or  otherwise  aiding 
in  scientific  discovery,  There  is  little  doubt  that  machine  learning  will  also  play  an  important 
role  in  the  development  of  computers  that  can  speak,  read  and  understand  human  languages. 

The  solution  of  such  complicated  tasks  is  clearly  beyond  the  skill  of  even  the  best  program¬ 
mer;  what  is  needed  is  the  development  of  systems  that  learn,  computers  that  can  program 
themselves.  Even  if  we  could  directly  program  a  machine  to  solve  such  tasks,  the  result  would 
be  a  highly  inflexible  machine  capable  only  of  carrying  out  the  specific  tasks  for  which  it  was  pre¬ 
programmed.  On  the  other  hand,  a  system  incorporating  machine  learning  technology  would, 
in  principle,  be  much  more  versatile,  capable  of  adapting  to  changing  conditions.  Flexibility  is 
vital  to  several  of  the  problems  mentioned  above;  for  instance,  a  mobile  robot  should  be  able 
to  adapt  to  changing  terrain,  and  a  computer  that  interprets  spoken  English  should  be  able  to 
adjust  to  a  different  speaker. 

Besides  these  practical  considerations,  the  study  of  machine  learning  may  also  lead  to  a 
deeper  understanding  of  human  intelligence  and  of  the  remarkable  phenomena  of  learning, 
inference,  induction  and  adaptation. 

Though  the  objective  of  machine  learning  research  seems  clear,  the  way  to  reach  that  ob¬ 
jective  is  not  so  obvious.  This  tantalizing  problem  has  attracted  researchers  from  a  myriad  of 
different  fields,  including  psychology,  cognitive  science,  neural  science,  linguistics,  philosophy, 
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physics,  computer  science,  control  theory  and  mathematics.  The  approaches  applied  to  the 
problem  in  recent  years  are  also  tremendously  varied.  For  instance,  there  is  today  a  great  deal 
of  interest  in  so-called  connectionist  learning  algorithms  [2,  76],  some  of  which  are  loosely  in¬ 
spired  by  the  anatomy  of  the  brain.  Others,  such  as  Holland  [45,  46],  work  on  so-called  genetic 
algorithms  which  learn  by  mimicking  evolutionary  processes.  Gyorgi  and  Tishby  [32]  have  re¬ 
cently  used  principles  of  statistical  mechanics  to  understand  certain  learning  algorithms,  and 
Drescher  [21]  has  designed  and  implemented  a  learning  system  based  on  Piaget’s  theories  of 
early  childhood  development.  These  examples  illustrate,  but  by  no  means  exhaust,  the  diversity 
and  abundance  of  approaches  applied  to  the  problem  of  machine  learning.  For  a  more  compre¬ 
hensive  treatment,  see  for  instance  the  survey  papers  of  Dietterich  [20]  and  Mitchell  et  al.  [64]; 
there  are  also  several  collections  of  learning  papers  available  [54,  62,  63,  65,  80]. 

The  results  in  this  thesis  belong  to  an  area  of  machine  learning  research  known  as  compu¬ 
tational  learning  theory.  Here,  our  purpose  is  to  investigate  the  problem  of  machine  learning  in 
a  mathematically-oriented  framework.  Our  goal  is  to  develop  a  sound,  theoretical  foundation 
for  studying  and  understanding  machine  learning. 

Computational  learning  theory  is  also  characterized  by  its  emphasis  on  efficiency.  Thus, 
we  are  interested  in  building  machines  that  not  only  learn,  but  that  also  learn  efficiently, 
consuming  limited  resources  as  sparingly  as  possible.  Specifically,  we  are  most  often  interested 
in  minimizing  the  time  it  will  take  for  a  machine  to  learn,  the  size  computer  that  will  be  needed 
(in  terms  of  memory  size),  and  the  amount  of  data  that  must  be  collected  for  learning  to  take 
place.  Efficiency  is  an  extremely  important  issue.  For  the  purposes  of  developing  computer 
programs  that  run  fast,  the  design  of  efficient  algorithms  is  arguably  of  greater  importance 
than  the  development  of  faster  and  more  powerful  computer  processors. 

This  thesis  is  about  the  design  of  efficient  learning  algorithms,  and  the  analysis  of  their 
performance.  The  main  part  of  this  thesis  is  organized  into  four  fully  self-contained  chapters 
(i.e.,  no  chapter  is  prerequisite  to  any  other).  Each  chapter  considers  a  different  aspect  of 
machine  learning.  Here  are  some  of  the  main  contributions: 

Chapter  2  describes  a  general  technique  for  dramatically  improving  the  error  rate  achieved 
by  a  certain  kind  of  concept-learning  algorithm.  Specifically,  it  is  shown  that  a  “weak”  learning 
algorithm  whose  error  rate  is  just  slightly  better  than  that  achieved  by  random  guessing  can 
be  converted  into  one  whose  error  rate  is  extremely  small.  A  surprising  consequence  of  this 
result  is  a  technique  for  dramatically  improving  the  space  efficiency  of  many  known  learning 
algorithms. 

Chapter  3  explores  in  detail  a  simple  but  powerful  statistical  method  for  efficiently  inferring 
the  structure  of  certain  kinds  of  Boolean  formulas  from  random  examples  of  the  formula’s 
input-output  behavior.  A  featured  application  of  this  method  is  an  algorithm  that  efficiently 
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infers  a  good  approximation  of  any  read-once  formula  (in  which  each  variable  occurs  at  most 
once)  when  the  random  examples  are  chosen  according  to  a  product  distribution  (in  which  the 
setting  of  each  variable  is  independent  of  the  settings  of  the  other  variables). 

Chapter  4  extends  a  standard  model  of  concept  learning  to  accommodate  concepts  that 
sometimes  exhibit  uncertain  or  probabilistic  behavior.  This  chapter  systematically  explores  a 
variety  of  tools  and  techniques  for  designing  efficient  learning  algorithms  in  such  a  probabilistic 
setting.  For  example,  we  describe  an  algorithm  for  learning  a  probabilistic  analog  of  decision 
lists. 

Finally,  Chapter  5  presents  a  set  of  efficient  algorithms  to  be  used  by  a  robot  to  infer  the 
“structure”  of  its  environment  through  experimentation.  In  particular,  we  describe  algorithms 
that  efficiently  infer  the  structure  of  any  deterministic  and  finite-state  environment  by  planning 
and  executing  experiments,  and  with  the  additional  aid  of  a  source  of  “counterexamples”  to 
incorrectly  conjectured  models  of  the  environment. 

A  more  detailed  overview  of  this  thesis  follows  below. 

Concept  learning 

Many  of  the  results  in  this  thesis  deal  with  one  of  the  most  fundamental  of  learning  problems, 
namely,  learning  a  concept  from  examples.  In  recent  years,  a  great  deal  of  research  has  been 
devoted  to  this  learning  problem,  much  of  it  within  the  framework  of  one  particular  learning 
model  introduced  by  Valiant  [83]  in  1984.  Since  it  is  so  important  both  to  the  results  in 
this  thesis  and  to  current  research  in  computational  learning  theory,  we  begin  with  a  brief 
introduction  to  the  Valiant  model. 

Informally,  a  concept  is  a  rule  that  divides  the  world  into  positive  and  negative  examples. 
For  instance,  the  concept  of  “being  red”  divides  the  world  into  those  things  that  are  red  and 
those  that  are  not  red.  The  learning  algorithm  is  presented  with  examples  of  the  concept,  and 
is  told  if  each  is  a  positive  or  negative  example  of  the  concept.  For  instance,  the  learner  might 
be  shown  several  objects  and  told  whether  each  is  red  or  not. 

To  determine  if  the  learner  has  succeeded  in  ‘'learning’’  the  concept,  we  give  it  a  test:  we 
present  it  with  one  or  more  unclassified  examples,  and  ask  it  to  predict  if  each  is  a  positive  or 
negative  example  of  the  concept;  if  the  learner  “understands”  the  concept,  it  should  have  no 
trouble  passing  this  test. 

The  “universe  of  objects”  from  which  the  learner  is  presented  examples  is  called  the  domain 
(or  sometimes,  the  instance  space),  and  each  object  in  the  domain  is  called  an  instance.  For 
example,  in  the  example  above,  the  domain  might  consist  of  all  the  fruit  in  the  world,  in  which 
case  all  of  the  examples  observed  by  the  learner  are  pieces  of  fruit,  and  the  learner’s  job  is  to 
distinguish  red  fruit  from  non-red  fruit. 
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Figure  1:  Learning  rectangles  in  the  plane. 

As  another  example,  the  learner  might  be  trying  to  learn  the  “concept”  of  the  solid-border 
rectangle  c  in  Figure  1.  In  this  example,  the  domain  is  the  set  of  all  points  in  the  real  plane.  The 
learner  is  presented  with  the  sample  points  labeled  +  or  -  as  shown  in  the  figure.  When  tested, 
the  learner  may,  for  example,  choose  to  classify  all  points  inside  the  dashed-border  rectangle  h 
as  positive,  and  those  outside  as  negative.  (This  would  be  a  reasonable  prediction  rule  since  it 
is  consistent  with  the  observed  sample.)  Such  a  prediction  rule  is  called  a  hypothesis.  In  this 
case,  the  learner  will  misclassify  a  test  point  if  and  only  if  it  falls  in  the  shaded  region,  the 
so-called  “symmetric  difference”  between  the  two  rectangles  c  and  h. 

Note  that  in  this  example,  we  assume  that  the  learner  knows  a  priori  that  the  target  concept 
c  (the  one  it  is  trying  to  learn)  is  a  rectangle.  That  is,  as  is  typical  for  such  learning  problems, 
the  learner  knows  beforehand  that  c  belongs  to  some  concept  class ,  namely,  the  class  of  all 
rectangles. 

What  we  have  not  yet  specified  is  how  the  examples  presented  to  the  learner  for  training 
and  for  testing  should  be  chosen.  Typically,  we  assume  that  the  examples  are  chosen  at  random 
(although  other  scenarios  are  certainly  possible).  This  random  selection  of  examples  is  meant 
to  model  the  “random”  observations  that  might  be  made  in  the  real  world.  However,  the 
distribution  of  such  observations  may  be  quite  arbitrary.  We  therefore  will  often  ask  that 
our  learning  algorithm  be  effective  when  examples  are  presented  randomly  according  to  any 
distribution  on  the  domain. 
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Thus,  returning  to  the  previous  example,  we  ask  that  the  learner  be  able  to  learn  the  differ¬ 
ence  between  red  and  non-red  fruit,  regardless  of  the  “true”  distribution  of  fruit  in  the  world. 
This  may  seem  unfair  since  the  entirely  arbitrary  distribution  may  be  such  that  some  kinds  of 
fruit  are  never  observed.  For  instance,  the  learner  might  only  be  shown  green  fruit,  making 
it  impossible  to  truly  learn  the  concept  of  red  fruit.  However,  we  only  ask  that  the  learner 
perform  well  when  tested  on  instances  chosen  randomly  according  to  the  same  distribution  on 
which  it  was  trained.  Thus,  if  the  learner  never  saw  a  red  piece  of  fruit  in  training,  then  it  is 
unlikely  that  it  will  see  one  in  testing,  and  so  it  should  be  able  to  pass  such  a  test. 

We  measure  the  quality  of  a  learning  algorithm  by  its  expected  performance  on  such  a  test. 
More  precisely,  the  error  of  a  learning  algorithm  (or  rather,  of  its  hypothesis)  is  the  probability 
that  it  will  misclassify  a  new  instance  when,  as  described  above,  the  new  instance  is  chosen 
randomly  according  to  the  target  distribution,  the  distribution  on  which  the  learner  was  trained. 
Equivalently,  the  error  is  the  expected  fraction  of  instances  that  will  be  misclassified  in  any  test. 
Similarly,  the  accuracy  is  the  chance  that  the  learner  correctly  classifies  a  new  instance. 

We  ask  that  the  learner  be  able  to  make  its  error  arbitrarily  small.  That  is,  for  any  arbitrarily 
small  positive  number  c,  the  learner  should  be  able  to  find  a  hypothesis  with  error  less  than  c. 
Also,  there  is  always  some  small  chance  that  the  algorithm  receives  an  unfairly  unrepresentative 
sample  that  causes  it  to  fail  to  learn  the  concept  with  the  desired  accuracy.  However,  we  ask 
that  the  probability  of  such  a  failure  be  less  than  6,  where  6,  like  c,  is  any  arbitrarily  small 
positive  number. 

Naturally,  the  smaller  the  chosen  values  of  e  and  6,  the  larger  the  sample  needed  by  the 
learning  algorithm,  and  the  longer  the  computation  time  needed.  However,  we  ask  that  the 
sample  size  and  computation  time  not  grow  too  quickly  as  e  and  6  are  made  small,  nor  as 
other  parameters  of  the  learning  problem  increase.  In  other  words,  we  ask  that  the  learning 
algorithm  be  efficient.  To  make  this  notion  of  efficiency  more  precise,  we  require  that  the 
learning  algorithm’s  running  time  be  bounded  by  a  polynomial  in  1/e,  1/6  and  any  other 
parameters  which  measure  the  size  of  the  learning  problem.  (For  example,  if  learning  “hyper¬ 
rectangles”  in  R",  we  might  allow  the  running  time  to  also  be  polynomial  in  n.) 

As  mentioned  above,  the  formal  model  we  have  just  described  was  introduced  by  Valiant  [83]. 
Since  the  target  distribution  may  be  arbitrary,  this  model  is  sometimes  called  the  distribution - 
free  model.  The  model  is  also  referred  to  as  the  probably  approximately  correct  (PAC)  model 
since  the  learning  algorithm’s  hypotheses  should  be  approximately  correct  (have  low  error)  with 
high  probability. 

Actually,  the  names  “PAC”  and  “distribution-free”  refer  to  different  aspects  of  the  learn¬ 
ing  model,  and  it  makes  sense  to  talk  about  a  PAC-learning  algorithm  (one  that  achieves  low 
error  with  high  probability)  that  is  effective  only  against  certain  restricted  target  distribu- 
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tions.  (However,  unless  explicitly  stated  otherwise,  PAC  learning  refers  to  learning  in  Valiant’s 
distribution-free  model.) 

Improving  a  mediocre  learning  algorithm 

Besides  restricting  the  target  distribution,  there  are  many  other  ways  in  which  the  Valiant 
model  might  be  modified;  in  fact,  most  of  the  results  in  this  thesis  deal  with  variations  on  the 
Valiant  model.  For  instance,  the  Valiant  model  seems  very  demanding  in  its  insistence  that  the 
learner  be  able  to  make  the  error  of  its  hypotheses  arbitrarily  small.  Why  not  just  require  that 
the  learner  be  90%  correct,  or  65%  correct?  (After  all,  65%  is  a  passing  grade  in  most  American 
grade  schools.)  Note  that  a  learning  algorithm  that  guesses  entirely  at  random  on  every  test 
point  achieves  an  accuracy  rate  of  50%.  What  if  we  then  require  the  learning  algorithm  to  be 
only  51%  correct?  Certainly,  we  would  expect  it  to  be  much  easier  to  design  such  a  “weak” 
learning  algorithm  that  performs  just  barely  better  than  random  guessing  than  it  would  be  to 
find  one  that  achieves  an  accuracy  rate  of,  say,  99.9%. 

Chapter  2  of  this  thesis  considers  exactly  this  question.  Somewhat  surprisingly,  it  turns  out 
that  any  such  weak  learning  algorithm  can  be  efficiently  converted  into  one  whose  error  can 
be  made  arbitrarily  small.  This  result  is  relevant  to  the  design  of  efficient  learning  algorithms 
since,  in  the  future,  to  find  an  extremely  good  learning  algorithm,  it  will  suffice  to  first  design 
one  that  performs  only  slightly  better  than  random  guessing. 

The  result  is  relevant  to  algorithm  design  for  another  reason  as  well:  quite  unexpectedly, 
it  turns  out  that  the  main  result  of  Chapter  2  can  be  applied  to  dramatically  improve  the 
complexity  (i.e.,  the  efficiency)  of  any  PAC-learning  algorithm,  in  several  respects.  Specifically, 
we  show  that  any  PAC-learning  algorithm  can  be  converted  into  one  whose  time  and  sample-size 
requirements  come  close  to  the  best  possible,  and,  even  more  surprisingly,  whose  space  (memory) 
requirements  are  very  modest.  For  example,  the  memory  size  needed  by  this  converted  algorithm 
is  much  less  than  would  be  necessary  to  store  the  entire  sample  (as  is  done  by  many  previous 
PAC-learning  algorithms). 

Learning  Boolean  formulas  from  random  examples 

Thus,  dropping  the  seemingly  strong  requirement  that  arbitrarily  good  error  rates  be  attainable 
does  not  change  what  can  or  cannot  be  learned  in  the  Valiant  model.  However,  restricting  the 
target  distribution,  a  variation  suggested  above,  does  turn  out  to  significantly  affect  what  can 
be  learned.  This  is  the  topic  of  Chapter  3. 

Specifically,  Chapter  3  considers  the  problem  of  learning  read-once  Boolean  formulas  against 
certain  restricted  distributions.  Informally,  a  formula  is  an  expression  that  can  be  written  down 
in  terms  of  variables  and  simple  functions  or  operators.  A  Boolean  formula  is  one  in  which  each 
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variable  is  Boolean-valued  (i.e.,  either  true  or  false),  and  the  formula  itself  evaluates  to  a 
Boolean  value. 

For  example,  a  car  buzzer  buzzes  if  the  key  is  in  the  ignition  and  the  door  is  open,  or  if 
the  motor  is  on  and  the  seat  belt  is  unfastened.  Let  K,  D,  M  and  S  be  four  Boolean  variables 
which  are  true  if  and  only  if,  respectively,  the  key  is  in  the  ignition,  the  door  is  open,  the  motor 
is  on,  and  the  seat  belt  is  fastened.  Then  the  buzzer  buzzes  if  and  only  if  the  Boolean  formula 

( K  AND  D )  OR  (Af  AND  NOT(S)) 

evaluates  to  true.  Here,  and,  or  and  NOT  are  the  standard  logical  Boolean  operators  which 
behave  just  as  would  be  expected  (i.e.,  x  and  y  is  true  if  and  only  if  both  x  and  y  are  true, 
and  so  on).  Also,  this  formula  would  be  said  to  be  read-once  since  it  contains  at  most  one 
occurrence  of  each  variable. 

Boolean  formulas  can  be  used  to  compute  quite  complicated  functions,  and  it  is  natural  to 
consider  their  learnability.  (Thus,  in  this  learning  problem,  the  domain  is  the  set  of  all  Boolean 
assignments  to  the  variables,  and  an  instance  (an  assignment  to  the  variables)  is  a  positive 
example  if  and  only  if  this  assignment  causes  the  formula  to  evaluate  to  true.)  Unfortunately, 
Boolean  formulas  cannot  be  learned  efficiently  in  the  distribution-free  learning  model  (given 
certain  technical  assumptions),  as  was  proved  by  Kearns  and  Valiant  [49,  52].  In  fact,  their 
result  holds  even  if  a  wide  range  of  restrictions  are  made  on  the  form  of  the  target  formula;  for 
instance,  their  result  holds  even  if  the  formula  is  read-once. 

Since  our  goal  is  to  find  the  most  general  circumstances  in  which  learning  can  take  place, 
it  makes  sense  to  ask  if  this  intractability  result  holds  even  for  restricted  distributions.  For 
instance,  can  read-once  Boolean  formulas  be  learned  efficiently  when  the  target  distribution  is 
uniform  (assigning  equal  probability  to  every  point  in  the  domain)?  One  of  the  main  results  of 
Chapter  3  is  an  affirmative  answer  to  this  question.  In  fact,  we  show  that  such  formulas  can 
be  PAC  learned  against  any  product  distribution  in  which  the  setting  of  each  variable  is  chosen 
independently  of  the  settings  of  the  other  variables. 

This  result  is  based  on  a  simple  but  powerful  technique  based  on  sampling  of  the  formula’s 
statistical  behavior  under  various  perturbations  of  the  target  distribution.  One  of  the  very  nice 
features  of  this  technique  is  its  robustness  to  noise  or  randomness  that  corrupts  the  learner’s 
observations.  The  algorithm  mentioned  above  for  learning  read-once  Boolean  formulas  can 
.handle  a  great  deal  of  randomness  that  affects  the  formula’s  behavior  in  a  variety  of  ways. 

This  technique  can  also  be  applied  to  exactly  identify  certain  kinds  of  read-once  formulas 
against  certain  fixed  distributions;  that  is,  the  learning  algorithm  identifies  the  exact  structure 
of  the  target  formula,  and  so  obtains  a  hypothesis  with  100%  accuracy.  Thus,  since  these  same 
classes  of  formulas  are  known  to  be  not  even  weakly  learnable  in  the  distribution-free  model, 
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our  results  can  be  interpreted  as  demonstrating  that  while  there  are  some  distributions  which  in 
a  computationally  bounded  setting  reveal  essentially  no  information  about  the  target  formula, 
there  are  natural  and  simple  distributions  which  reveal  all  information. 

Dealing  with  uncertainty 

One  rather  unrealistic  aspect  of  the  Valiant  model  is  its  assumption  of  determinism  in  the 
classification  of  instances;  that  is,  we  assume  that  the  target  concept  classifies  every  instance 
in  the  domain  as  either  a  positive  or  a  negative  example.  In  the  real  world,  things  can  be 
(and  usually  are)  much  more  uncertain.  For  example,  returning  to  the  problem  of  learning 
the  concept  of  “being  red,”  there  are  many  objects  in  the  world  which  might  or  might  not  be 
called  red,  depending  on  who  is  asked;  for  instance,  this  would  likely  be  the  case  for  a  crimson 
Harvard  pennant,  or  a  glass  of  Burgundy.  A  good  learning  algorithm  should  be  able  to  deal 
with  the  fact  that  “red”  is  a  concept  with  a  “fuzzy”  boundary,  and  that  some  instances  in  the 
domain  will  sometimes  be  classified  “red,”  and  sometimes  not. 

For  another  example,  consider  the  problem  of  learning  to  predict  whether  or  not  it  will  rain 
tomorrow  based  on  today’s  weather  conditions.  Practically  speaking,  tomorrow’s  weather  is  at 
least  to  some  degree  a  probabilistic  event,  and  the  best  we  can  usually  do  is  to  try  to  predict 
the  probability  of  rain  tomorrow. 

Chapter  4  extends  the  Valiant  model  to  incorporate  such  probabilistic  concepts  (or  p- 
concepts).  Specifically,  a  p-concept  c  is  a  real-valued  function  that  assigns  to  each  instance 
i  a  probability  c(i)  of  being  labeled  positively.  For  instance,  the  p-concept  of  “being  red” 
might  assign  a  very  high  value  (near  1)  to  a  strawberry,  a  very  low  value  (near  0)  to  a  banana, 
and  some  value  in  between  to  a  pomegranate. 

This  chapter  describes  efficient  algorithms  for  learning  various  classes  of  p-concepts.  These 
include  the  class  of  all  nondecreasing  functions  on  the  real  line,  and  a  probabilistic  analog  of  a 
class  of  concepts  introduced  by  Rivest  [72]  called  decision  lists. 

In  addition  to  these  and  other  efficient  algorithms  for  learning  in  the  p-concepts  model,  we 
study  in  detail  the  underlying  theory  of  learning  p-concepts.  For  instance,  we  give  a  technique 
for  testing  the  quality  of  candidate  hypotheses.  That  is,  given  two  hypotheses,  we  give  a  statis¬ 
tical  method  that  can  be  used  to  determine  which  is  better.  For  example,  if  two  meteorologists 
apply  for  a  job,  how  can  we  determine  which  makes  better  predictions?  If  one  predicts  that  it 
will  rain  tomorrow  with  70%  probability,  and  the  other  says  the  chance  of  rain  is  85%,  it  is  not 
so  clear  how  to  determine  which  prediction  is  more  accurate  since  tomorrow  it  will  either  rain 
or  it  won’t.  (We  have  no  direct  access  to  what  the  “true”  probability  of  rain  is  tomorrow.)  We 
describe  in  this  chapter  a  technique  for  making  such  a  determination. 

We  also  give  a  non-trivial  lower  bound  on  the  number  of  examples  needed  to  learn  in  this 
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Figure  2:  A  crossword  puzzle  environment. 

model,  and  we  extend  some  of  the  older  algorithm-design  techniques  from  the  (deterministic) 
Valiant  model  to  the  new  p-concept  model. 

Learning  the  structure  of  an  unfamiliar  environment 

Finally,  in  Chapter  5,  we  consider  a  very  different  learning  problem.  This  chapter,  unlike  the 
others,  is  not  concerned  with  concept  learning,  nor  with  learning  from  random  examples.  In  this 
chapter,  we  study  the  problem  of  learning  about  one’s  environment  through  experimentation. 

Imagine  a  robot  that  has  been  placed  in  unfamiliar  surroundings.  For  instance,  the  robot 
might  find  itself  in  the  “crossword  puzzle”  environment  of  Figure  2.  In  such  an  environment, 
the  robot  has  a  limited  number  of  actions,  and  receives  a  very  limited  amount  of  sensory 
information  about  its  environment.  For  instance,  in  this  example  environment,  the  robot  can 
step  forward  one  square,  or  turn  left  or  right  by  90  degrees.  If  the  robot  attempts  to  step 
forward,  but  its  path  is  blocked  by  a  wall  or  one  of  the  black  squares,  then  nothing  happens. 
In  this  environment,  each  wall  has  been  painted  a  different  color,  and  the  robot  can  detect  the 
color  of  the  wall  it  faces;  however,  if  its  view  is  obstructed  by  a  black  square,  then  it  only  sees 
black.  This  is  the  only  sense  data  the  robot  receives  about  its  world. 

We  assume  that  the  robot  knows  nothing  a  priori  about  its  environment,  except  that  the 
environment  is  deterministic  and  finite  state.  The  robot  gathers  information  about  its  environ¬ 
ment  by  executing  actions  and  observing  changes  in  the  sensory  data  it  receives. 
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We  make  the  realistic  assumption  that  the  robot  has  no  “reset”  or  means  of  bringing  the 
environment  back  to  some  fixed  start  state.  This  distinguishes  our  work  from  much  of  the 
p.evious  research  on  this  problem. 

The  goal  of  the  robot  is  to  discover  the  “structure”  of  its  environment  by  planning  and 
executing  experiments.  In  other  words,  we  ask  that  the  robot  build  a  very  good  model  of  its 
environment  through  experimentation.  As  was  the  case  for  concept  learning,  we  can  judge  the 
quality  of  the  robot’s  model  by  testing  it:  given  a  sequence  of  actions,  the  robot  should  be  able 
to  use  its  model  to  predict  what  sensations  will  be  observed  when  those  actions  are  executed. 
An  action  sequence  that  causes  the  robot’s  model  to  make  an  incorrect  prediction  is  called  a 
counterexample. 

It  turns  out  that  the  robot  cannot  efficiently  build  a  perfect  model  of  its  environment  using 
only  experimentation;  the  reason  is  that  there  are  some  environments  with  hard-to-reach  states 
that  can  never  be  discovered  (efficiently)  simply  through  experimentation.  We  therefore  find 
it  necessary  to  assume  that  the  robot  has  some  source  of  counterexamples.  Thus,  the  robot 
runs  experiments,  builds  a  model  of  its  environment,  and  then  obtains  a  counterexample  to  this 
conjectured  model.  The  robot  repeats  this  process  until  it  converges  to  a  perfect  model.  (This 
source  of  counterexamples  is  not  as  unnatural  as  it  sounds  on  first  blush  since,  in  practice,  the 
robot  can  often  discover  counterexamples  on  its  own.  For  example,  a  counterexample  can  often 
be  found  by  simply  taking  a  “random  walk”  through  the  environment.) 

In  this  framework,  we  describe  an  efficient  algorithm  that  the  robot  can  use  to  discover 
the  structure  of  its  environment.  This  is  the  first  provably  fast  and  effective  algorithm  for  this 
problem.  We  also  describe  an  algorithm  that  solves  the  same  problem,  but  that  is  based  on  a 
different  representation  of  finite-state  environments,  called  the  diversity-based  representation. 
Finally,  for  a  special  class  of  environments  (called  permutation  environments),  we  also  present 
efficient  algorithms  that  are  effective  even  in  the  absence  of  a  source  of  counterexamples. 

Summary 

In  sum,  this  thesis  extends  the  current  theory  of  machine  learning  in  several  new  directions. 

Chapter  2  extends  our  fundamental  understanding  of  the  PAC  model  by  demonstrating 
an  important  property  of  this  model,  namely,  that  seemingly  weak  learning  algorithms  that 
perform  only  slightly  better  than  random  guessing  can  be  converted  into  algorithms  that  per¬ 
form  extremely  well.  The  result  also  implies  interesting  and  surprising  upper  bounds  on  the 
complexity  of  learning  in  the  PAC  model. 

Chapter  3  extends  the  known  techniques  for  efficiently  learning  the  class  of  read-once 
Boolean  formulas.  The  statistical  technique  that  is  presented  is  simple,  powerful,  and  quite 
robust  to  noise. 
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Chapter  4  extends  the  basic  PAC  model  to  incorporate  randomness  or  uncertainty  that  is 
almost  certain  to  be  found  in  any  real-world  application  of  learning  technology.  The  chapter 
explores  a  range  of  techniques  for  designing  efficient  learning  algorithms  in  such  a  setting. 

Finally,  Chapter  5  extends  our  ability  to  efficiently  infer  the  structure  of  a  deterministic, 
finite-state  environment  through  experimentation.  This  chapter  includes  algorithms  for  learning 
in  such  environments  even  in  the  absence  of  a  “reset,”  using  either  of  two  representations  for 
finite-state  systems. 


Chapter  2 


The  Strength  of  Weak  Learnability 


2-1  Introduction 

Since  Valiant’s  [83]  pioneering  paper,  interest  has  flourished  in  the  so-called  distribution-free 
or  probably  approximately  correct  (PAC)  model  of  learning.  In  this  model,  the  learner  tries  to 
identify  an  unknown  concept  based  on  randomly  chosen  examples  of  the  concept.  Examples 
are  chosen  according  to  an  unchanging  but  unknown  and  arbitrary  distribution  on  the  space  of 
instances.  The  learner’s  task  is  to  find  a  hypothesis  or  prediction  rule  of  its  own  that  correctly 
classifies  new  instances  as  positive  or  negative  examples  of  the  concept.  With  high  probability, 
the  hypothesis  must  be  correct  for  all  but  an  arbitrarily  small  fraction  of  the  instances. 

Often,  the  inference  task  includes  a  requirement  that  the  output  hypothesis  be  of  a  specified 
form.  However,  in  this  chapter  (and  throughout  most  of  this  thesis)  we  will  instead  be  concerned 
with  a  representation-independent  model  of  learning  in  which  the  learner  may  output  any 
hypothesis  that  can  be  used  to  classify  instances  in  polynomial  time. 

A  class  of  concepts  is  leamable  (or  strongly  learnable)  if  there  exists  a  polynomial-time 
algorithm  that  achieves  low  error  with  high  confidence  for  all  concepts  in  the  class.  A  weaker 
model  of  learnability,  called  weak  learnability,  drops  the  requirement  that  the  learner  be  able 
to  achieve  arbitrarily  high  accuracy;  a  weak  learning  algorithm  need  only  output  a  hypothesis 
that  performs  slightly  better  (by  an  inverse  polynomial)  than  random  guessing.  The  notion 
of  weak  learnability  was  introduced  by  Kearns  and  Valiant  [52]  who  left  open  the  question  of 
whether  the  notions  of  strong  and  weak  learnability  are  equivalent.  This  question  was  termed 
the  hypothesis  boosting  problem  since  showing  the  notions  are  equivalent  requires  a  method  for 
boosting  the  low  accuracy  of  a  weak  learning  algorithm’s  hypotheses. 

Kearns  [48],  considering  the  hypothesis  boosting  problem,  gives  a  convincing  argument 
discrediting  the  natural  approach  of  trying  to  boost  the  accuracy  of  a  weak  learning  algorithm 
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by  simply  running  the  procedure  many  times  and  taking  the  “majority  vote”  of  the  output 
hypotheses.  Also,  Kearns  and  Valiant  [49,  52]  show  that,  under  a  uniform  distribution  on  the 
instance  space,  monotone  Boolean  functions  are  weakly,  but  not  strongly,  learnable.  This  shows 
that  strong  and  weak  learnability  are  not  equivalent  when  certain  restrictions  are  placed  on 
the  instance  space  distribution.  Thus,  it  did  not  seem  implausible  that  the  strong  and  weak 
learning  models  would  prove  to  be  inequivalent  for  unrestricted  distributions  as  well. 

Nevertheless,  in  this  chapter,  the  hypothesis  boosting  question  is  answered  in  the  affirmative. 
The  main  result  is  a  proof  of  the  perhaps  surprising  equivalence  of  strong  and  weak  learnability. 

This  result  may  have  significant  applications  as  a  tool  for  proving  that  a  concept  class  is 
learnable  since,  in  the  future,  it  will  suffice  to  find  an  algorithm  correct  on  only,  say,  51%  of  the 
instances  (for  all  distributions).  Alternatively,  in  its  negative  contrapositive  form,  the  result 
says  that  if  a  concept  class  cannot  be  learned  with  accuracy  99.9%,  then  we  cannot  hope  to  do 
even  slightly  better  than  guessing  on  the  class  (for  some  distribution). 

The  proof  presented  here  is  constructive;  an  explicit  method  is  described  for  directly  con¬ 
verting  a  weak  learning  algorithm  into  one  that  achieves  arbitrary  accuracy.  The  construction 
uses  filtering  to  modify  the  distribution  of  examples  in  such  a  way  as  to  force  the  weak  learning 
algorithm  to  focus  on  the  harder-to-learn  parts  of  the  distribution.  Thus,  the  distribution-free 
nature  of  the  learning  model  is  fully  exploited. 

Since  this  result  was  first  published,  Freund  [26]  was  able  to  improve  the  construction 
presented  in  this  chapter.  His  construction  yields  hypotheses  that  are  simpler  in  form  and 
smaller  in  size.  Some  implementations  of  his  procedure  are  also  more  efficient  than  those 
described  here. 

Consequences 

An  immediate  corollary  of  the  main  result  is  the  equivalence  of  strong  and  group  learnability. 
A  group-learning  algorithm  need  only  output  a  hypothesis  capable  of  classifying  large  groups 
of  instances,  all  of  which  are  either  positive  or  negative.  The  notion  of  group  learnability 
was  considered  by  Kearns  et  al.  [51],  and  was  shown  to  be  equivalent  to  weak  learnability  by 
Kearns  and  Valiant  [49,  52].  The  result  also  extends  those  of  Haussler  et  al.  [38]  which  prove 
the  equivalence  of  numerous  relaxations  and  variations  on  the  basic  PAC-learning  model;  both 
weak  and  group  learnability  are  added  to  this  class  of  equivalent  learning  models.  The  relevance 
of  the  main  result  to  a  number  of  other  learning  models  is  also  considered  in  this  chapter. 

An  interesting  and  unexpected  consequence  of  the  construction  is  a  proof  that  any  strong 
learning  algorithm  outputting  hypotheses  whose  length  (and  thus  whose  time  to  evaluate) 
depends  on  the  allowed  error  e  can  be  modified  to  output  hypotheses  of  length  only  polynomial 
in  log(l/c).  Thus,  any  learning  algorithm  can  be  converted  into  one  whose  output  hypotheses 
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do  not  become  significantly  more  complex  as  the  error  tolerance  is  lowered. 

Put  in  other  terms,  this  bound  implies  that  a  sequence  of  labeled  examples  of  a  learnable 
concept  can,  in  a  sense,  be  efficiently  "compressed”  into  a  far  more  compact  form — i.e.,  into  a 
rule  or  hypothesis  consistent  with  the  labels  of  the  examples.  In  particular,  it  is  shown  that  a 
sample  of  size  m  can  be  compressed  into  a  rule  of  size  only  poly-logarithmic  in  m.  In  fact,  in 
the  discrete  case,  the  size  of  the  output  hypothesis  is  entirely  independent  of  m.  This  provides  a 
partial  converse  to  Occam ’s  Razor ,  a  result  of  Blumer  et  al.  [13]  stating  that  the  existence  of  such 
a  compression  algorithm  implies  the  learnability  of  the  concept  class.  This  also  complements 
the  results  of  Board  and  Pitt  [15]  who  also  provide  a  partial  converse  to  Occam’s  Razor,  but  of 
a  somewhat  different  flavor.  Finally,  this  result  yields  a  strong  upper  bound  on  the  sample  size 
needed  to  learn  a  discrete  concept  class. 

We  show  that  such  results  also  apply  to  non-discrete  domains.  In  particular,  Littlestone 
and  Warmuth  [59]  describe  a  notion  of  data  compression  in  which  the  output  prediction  rule  is 
represented  by  a  sequence  of  examples  from  the  original  sample.  Such  compression  schemes  are 
also  considered  by  Floyd  [25].  Littlestone  and  Warmuth  show  that  the  existence  of  an  efficient 
compression  scheme  of  this  kind  for  some  concept  class  implies  the  learnability  of  the  class.  In 
this  chapter,  we  prove  the  converse,  showing  that  any  learning  algorithm  can  be  converted  into 
a  compression  scheme.  Thus,  we  prove  a  complete  characterization  of  efficient  PAC  learnability 
in  terms  of  data  compression. 

The  bound  we  prove  on  the  size  of  the  output  hypothesis  also  implies  the  hardness  of 
learning  any  concept  class  not  evaluatable  by  a  family  of  small  circuits.  For  example,  this 
shows  that  pattern  languages — a  class  of  languages  considered  previously  by  Angluin  [4]  and 
others — are  unlearnable  assuming  only  that  NP/poly  ^  P/poly.  This  is  the  first  representation- 
independent  hardness  result  not  based  on  cryptographic  assumptions.  The  bound  also  implies 
that,  for  any  function  not  computable  by  polynomial-size  circuits,  there  exists  a  distribution 
on  the  function’s  domain  over  which  the  function  cannot  be  even  roughly  approximated  by  a 
family  of  small  circuits. 

In  addition  to  the  bound  on  hypothesis  size,  the  construction  implies  a  set  of  general  upper 
bounds  on  the  dependence  on  c  of  the  time,  sample  and  space  complexity  needed  to  efficiently 
learn  any  learnable  concept  class.  Most  surprising  is  a  proof  that  there  exists  for  every  learnable 
concept  class  an  efficient  algorithm  requiring  space  only  poly-logarithmic  in  1/e.  Because  the 
size  of  the  sample  needed  to  learn  with  this  accuracy  is  in  general  S7(l/e),  this  means,  for 
example,  that  far  less  space  is  required  to  learn  than  would  be  necessary  to  store  the  entire 
sample.  Since  most  of  the  known  learning  algorithms  work  in  exactly  this  manner — i.e.,  by 
storing  a  large  sample  and  finding  a  hypothesis  consistent  with  it — this  implies  a  dramatic 
savings  of  memory  for  a  whole  class  of  algorithms  (though  possibly  at  the  cost  of  requiring  a 
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larger  sample). 

Such  general  complexity  bounds  have  implications  for  the  on-line  learning  model  as  well. 
In  this  model,  the  learner  is  presented  one  instance  at  a  time  in  a  series  of  trials.  As  each  is 
received,  the  learner  tries  to  predict  the  true  classification  of  the  new  instance,  attempting  to 
minimize  the  number  of  mistakes,  or  prediction  errors. 

Translating  the  bounds  described  above  into  the  on-line  model,  it  is  shown  that,  for  every 
learnable  concept  class,  there  exists  an  on-line  algorithm  whose  space  requirements  are  quite 
modest  in  comparison  to  the  number  of  examples  seen  so  far.  In  particular,  the  space  needed 
on  the  first  m  trials  is  only  poly-logarithmic  in  m.  Such  space  efficient  on-line  algorithms  are 
of  particular  interest  because  they  capture  the  notion  of  an  incremental  algorithm  forced  by  its 
limited  memory  to  explicitly  generalize  or  abstract  from  the  data  observed.  Also,  these  results 
on  the  space-efficiency  of  batch  and  on-line  algorithms  extend  the  work  of  others  interested  in 
this  problem,  including  Boucheron  and  Sallantin  [18],  Floyd  [25],  and  Haussler  [34].  In  partic¬ 
ular,  these  results  solve  an  open  problem  proposed  by  Haussler,  Littlestone  and  Warmuth  [39]. 

An  interesting  upper  bound  is  also  derived  on  the  expected  number  of  mistakes  made  on 
the  first  m  trials.  It  is  shown  that,  if  a  concept  class  is  learnable,  then  there  exists  an  on-line 
algorithm  for  the  class  for  which  this  expectation  is  bounded  by  a  polynomial  in  logm.  Thus, 
for  large  m,  we  expect  an  extremely  small  fraction  of  the  first  m  predictions  to  be  incorrect. 
This  result  answers  another  open  question  given  by  Haussler,  Littlestone  and  Warmuth  [39], 
and  significantly  improves  a  similar  bound  given  in  their  paper  (as  well  as  their  paper  with 
Kearns  [38])  of  m°  for  some  constant  a  <  1. 

2-2  Preliminaries 

We  begin  with  a  description  of  the  distribution-free  learning  model.  A  concept  c  is  a  Boolean 
function  on  some  domain  of  instances.  A  concept  class  C  is  a  collection  of  concepts.  Often,  C 
is  decomposed  into  subclasses  Cn  indexed  by  a  parameter  n.  That  is,  C  =  Un>i^n»  and  aU  the 
concepts  in  Cn  have  a  common  domain  X„.  We  assume  each  instance  in  X„  has  encoded  length 
bounded  by  a  polynomial  in  n,  and  we  let  X  =  Un>1  A'„.  Also,  we  associate  with  each  concept 
c  its  size  s,  typically  a  measure  of  the  length  of  c’s  representation  under  some  encoding  scheme 
on  the  concepts  in  C. 

For  example,  the  concept  class  C  might  consist  of  all  functions  computed  by  Boolean  formu¬ 
las.  In  this  case,  C„  is  the  set  of  all  functions  computed  by  a  Boolean  formula  on  n  variables, 
Xn  is  the  set  {0, 1}"  of  all  assignments  to  the  n  variables,  and  the  size  of  a  concept  c  in  C  is 
the  length  of  the  shortest  Boolean  formula  that  computes  the  function  c. 

The  learner  is  assumed  to  have  access  to  a  source  EX  of  examples.  Each  time  oracle  EX 
is  called,  one  instance  is  randomly  and  independently  chosen  from  Xn  according  to  some  fixed 
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but  unknown  and  arbitrary  distribution  D.  (More  formally,  D  is  a  probability  measure  on  a 
o--algebra  of  measurable  subsets  of  A'n.)  The  oracle  returns  the  chosen  instance  x,  along  with  a 
label  indicating  the  value  c(x)  of  the  instance  under  the  unknown  target  concept  c  €  C„.  Such 
a  labeled  instance  is  called  an  example.  We  assume  EX  runs  in  unit  time. 

Given  access  to  EX,  the  learning  algorithm  runs  for  a  time  and  finally  outputs  a  hypothesis 
h,  a  prediction  rule  on  Xn.  In  this  chapter,  we  make  no  restrictions  on  h  other  than  that 
there  exist  a  (possibly  probabilistic)  polynomial-time  algorithm  that,  given  h  and  an  instance 
x,  computes  h{x),  h's  prediction  on  x. 

We  write  Prr€£)[7r(x)]  to  indicate  the  probability  of  predicate  x  holding  on  instances  x  drawn 
from  Xn  according  to  distribution  D.  To  accommodate  probabilistic  hypotheses,  we  will  find  it 
useful  to  regard  x(x)  as  a  Bernoulli  random  variable.  For  example,  Pr[/i(x)  c(x)]  is  the  chance 
that  hypothesis  h  (which  may  be  randomized)  will  misclassify  some  particular  instance  x.  In 
contrast,  the  quantity  Prr6C[/i(x)  ^  c(x)|  is  the  probabi1:*”  that  h  will  misclassify  an  instance 
x  chosen  at  random  according  to  distribution  D.  Jote  that  this  last  probability  is  taken  over 
both  the  random  choice  of  x,  and  any  random  bits  used  by  h.  In  general,  we  have 

PrreD[x(x)]  =  f  Pr[r(x)]dD(x). 

Jx . 

The  probability  Pr z€D[h(x)  ^  c(x)|  is  called  the  error  of  h  on  c  under  D\  if  the  error  is 
no  more  than  e,  then  we  say  h  is  e-close  to  the  target  concept  c  under  D.  The  quantity 
PrrCD[/i(x)  =  c(x)]  is  the  accuracy  of  h  on  c  under  D. 

We  say  that  a  concept  class  C  is  learnable,  or  strongly  leamable,  if  there  exists  an  algorithm 
A  such  that  for  all  n  >  1,  for  all  target  concepts  c  £  C„,  for  all  distributions  D  on  X„,  and  for 
all  0  <  e,  6  <  1,  algorithm  A ,  given  parameters  n,  c,  <5,  the  size  s  of  c,  and  access  to  oracle  EX, 
runs  in  time  polynomial  in  n,  s,  1/c  and  1/6,  and  outputs  a  hypothesis  h  that  with  probability 
at  least  1  —  <5  is  f-close  to  c  under  D.  There  are  many  other  equivalent  notions  of  learnability, 
including  polynomial  predictability  [38].  Also,  note  that  other  authors  have  sometimes  used 
the  term  “learnable”  to  mean  something  slightly  different. 

Kearns  and  Valiant  [52]  introduced  a  weaker  form  of  learnability  in  which  the  error  c  cannot 
necessarily  be  made  arbitrarily  small.  A  concept  class  C  is  weakly  leamable  if  there  exists  a 
polynomial  p  and  an  algorithm  A  such  that  for  all  n  >  1,  for  all  target  concepts  c  €  C„,  for  all 
distributions  D  on  Xn,  and  for  all  0  <  6  <  1,  algorithm  A,  given  parameters  n,  6,  the  size  s  of 
c,  and  access  to  oracle  EX,  runs  in  time  polynomial  in  n,  s  and  1/6,  and  outputs  a  hypothesis 
h  that  with  probability  at  least  1  -  6  is  -close  to  c  under  D.  In  other  words,  a  weak 

learning  algorithm  produces  a  prediction  rule  that  performs  just  slightly  better  than  random 
guessing. 
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2-3  The  equivalence  of  strong  and  weak  learnability 

The  main  result  of  this  chapter  is  a  proof  that  learnability  and  weak  learnability  are  equivalent 
notions. 

Theoram  3.1  A  concept  class  C  is  weakly  learnable  if  and  only  if  it  is  leamable. 

That  strong  learnability  implies  weak  learnability  is  trivial.  The  remainder  of  this  section 
is  devoted  to  a  proof  of  the  converse.  We  assume  then  that  some  concept  class  C  is  weakly 
learnable  and  show  how  to  build  a  strong  learning  algorithm  around  a  weak  one. 

We  begin  with  a  description  of  a  technique  by  which  the  accuracy  of  any  algorithm  can  be 
boosted  by  a  small  but  significant  amount.  Later,  we  will  show  how  this  mechanism  can  be 
applied  recursively  to  make  the  error  arbitrarily  small. 

2-3.1  The  hypothesis  boosting  mechanism 

Let  A  be  an  algorithm  that  produces  with  high  probability  a  hypothesis  a-close  to  the  target 
concept  e.  We  sketch  an  algorithm  A'  that  simulates  A  on  three  different  distributions,  and 
outputs  a  hypothesis  significantly  closer  to  c. 

Let  EX  be  the  given  examples  oracle,  and  let  D  be  the  distribution  on  Xn  induced  by  EX. 
The  algorithm  A!  begins  by  simulating  A  on  the  original  distribution  Dx  =  D,  using  the  given 
oracle  EX i  =  EX.  Let  hx  be  the  hypothesis  output  by  A. 

Intuitively,  A  has  found  some  weak  advantage  on  the  original  distribution;  this  advantage  is 
expressed  by  h\.  To  force  A  to  learn  more  about  the  “harder”  parts  of  the  distribution,  we  must 
somehow  destroy  this  advantage.  To  do  so.  A'  creates  a  new  distribution  D2  under  which  an 
instance  chosen  according  to  D2  has  an  equal  chance  of  being  correctly  or  incorrectly  classified 
by  h\.  The  distribution  D2  is  simulated  by  filtering  the  examples  chosen  according  to  D  by  EX. 
To  simulate  D2,  a  new  examples  oracle  EX2  is  constructed.  When  asked  for  an  instance,  EX 2 
first  flips  a  fair  coin:  if  the  result  is  “heads,”  then  EX2  requests  examples  from  EX  until  one 
is  chosen  for  which  h j(i)  =  c(x);  otherwise,  EX2  waits  for  an  instance  to  be  chosen  for  which 
hi(x)  ^  c(z).  (Later  we  show  how  to  prevent  EX2  from  having  to  wait  too  long  in  either  of 
these  loops  for  a  desired  instance.)  The  algorithm  A  is  again  simulated,  this  time  providing  A 
with  examples  chosen  by  EX2  according  to  D2.  Let  h2  be  the  resulting  output  hypothesis. 

Finally,  Z?3  is  constructed  by  filtering  out  from  D  those  instances  on  which  hi  and  h2  agree. 
That  is,  a  third  oracle  EX3  simulates  the  choice  of  an  instance  according  to  D3  by  requesting 
instances  from  EX  until  one  is  found  for  which  h j(x)  h2{x).  (Again,  we  will  later  show  how 
to  limit  the  time  spent  waiting  in  this  loop  for  a  desired  instance.)  For  a  third  time,  algorithm 
A  is  simulated  with  examples  drawn  this  time  by  £X3,  producing  hypothesis  h3. 
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Figure  1:  A  graph  of  the  function  g(x)  =  3x2  -  2x3. 

At  last,  A!  outputs  its  hypothesis  h:  given  an  instance  x,  if  h i(x)  =  h2(x)  then  h  predicts 
the  agreed  upon  value;  otherwise,  h  predicts  h3(x).  (In  other  words,  h  takes  the  “majority 
vote”  of  h\,  h2  and  h3.)  Later,  we  show  that  h's  error  is  bounded  by  g(a )  =  3a2  —  2a3.  This 
quantity  is  significantly  smaller  than  the  original  error  a,  as  can  be  seen  from  its  graph  depicted 
in  Figure  1.  (The  solid  curve  is  the  function  g ,  and,  for  comparison,  the  dotted  line  shows  a 
graph  of  the  identity  function.) 

2-3.2  A  strong  learning  algorithm 

An  idea  that  follows  naturally  is  to  treat  the  previously  described  procedure  as  a  subroutine 
for  recursively  boosting  the  accuracy  of  weaker  hypotheses.  The  procedure  is  given  a  desired 
error  bound  e  and  a  confidence  parameter  6,  and  constructs  an  e-close  hypothesis  from  weaker, 
recursively  computed  hypotheses.  If  e  >  |  -  t)  then  an  assumed  weak  learning  algorithm  can 
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Learn ((,6,  EX) 

Input:  error  parameter  e 

confidence  parameter  6 

examples  oracle  EX 

(implicit)  size  parameters  s  and  n 

Return:  hypothesis  h  that  is  e-close  to  the  target  concept  c  with  probability  >1  —  6 
Procedure: 

if  e  >  ^  ’  t)  then  return  WeakLearn(6,  EX) 

a  9~l(e) 

EX,  -  EX 

/i,  «— Learn(a,6/5,  EX,) 
r,  «-  e/3 

let  a,  be  an  estimate  of  a,  =  Prrexj[/ii(a;)  ^  c(x)]: 

choose  a  sample  sufficiently  large  that  |a,  -  a,|  <  r,  with  probability  >  1  —  6/5 
if  a,  <  e  —  r,  then  return  /i, 

defun  EXj() 

{  flip  coin 

if  “heads.”  return  the  first  instance  x  from  EX  for  which  hl(x)  =  c(x) 
else  return  the  first  instance  x  from  EX  for  which  /i,(x)  ^  c(x)  } 

h2  <—  Learn(a,6/5,  EX2) 
r2  (1  -  2a)c/8 

let  e  be  an  estimate  of  e  =  7*  c(r)]: 

choose  a  sample  sufficiently  large  that  |e  —  e|  <  r2  with  probability  >1-6/5 
if  e  <  e  -  r2  then  return  /i2 

defun  EX3() 

{  return  the  first  instance  x  from  EX  for  which  h,(i)  ^  h2(x)  } 
h3  «—  Leaxn(o.6/5,  EX3) 

defun  /i(i) 

{  6,  4-  62  -  h2{x) 

if  6,  =  62  then  return  6, 
else  return  h3(x)  } 

return  h 


Figure  2:  A  strong  learning  algorithm  Learn. 

be  used  to  find  the  desired  hypothesis;  otherwise,  an  e-close  hypothesis  is  computed  recursively 
by  calling  the  subroutine  with  e  set  to  g~1(e). 

Unfortunately,  this  scheme  by  itself  does  not  quite  work  due  to  a  technical  difficulty  alluded 
to  above:  because  of  the  way  EX2  and  EX3  are  constructed,  examples  may  be  required  from 
a  very  small  portion  of  the  original  distribution.  If  this  happens,  the  time  spent  waiting  for 
an  example  to  be  chosen  from  this  region  may  be  great.  Nevertheless,  we  will  see  that  this 
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difficulty  can  be  overcome  by  explicitly  checking  that  the  errors  of  hypotheses  hi  and  /i2  on  D 
are  not  too  small. 

Figure  2  shows  a  detailed  sketch  of  the  resulting  strong  learning  algorithm  Learn.  The 
procedure  takes  an  error  parameter  e  and  a  confidence  parameter  6,  and  is  also  provided  with 
an  examples  oracle  EX.  The  procedure  is  required  to  return  a  hypothesis  whose  error  is  at 
most  e  with  probability  at  least  1  -  6.  In  the  figure,  p  is  a  polynomial  and  WeakLearn(£,  EX)  is 
an  assumed  weak  learning  procedure  that  outputs  a  hypothesis  -close  to  the  target 

concept  c  with  probability  at  least  1  —  6.  As  above,  g( a)  is  the  function  3a2  —  2a3,  and  the 
variable  a  is  set  to  the  value  <7-1(c).  Also,  the  quantities  di  and  e  are  estimates  of  the  errors  of 
hi  and  h3  under  the  given  distribution  D.  These  estimates  are  made  with  error  tolerances  Tj 
and  r2  (defined  in  the  figure),  and  are  computed  in  the  obvious  manner  based  on  samples  drawn 
from  EX;  the  required  size  of  these  samples  can  be  determined,  for  instance,  using  Chernoff 
bounds.  The  parameters  s  and  n  are  assumed  to  be  known  globally. 

Note  that  Learn  is  a  procedure  taking  as  one  of  its  inputs  a  function  (EX)  and  returning  as 
output  another  function  (h,  a  hypothesis,  which  is  treated  like  a  procedure).  Furthermore,  to 
simulate  new  example  oracles.  Learn  must  have  a  means  of  dynamically  defining  new  procedures 
(as  is  allowed,  for  instance,  by  most  Lisp-like  languages).  Therefore,  in  the  figure,  we  have  used 
the  somewhat  non-standard  keyword  defun  to  denote  the  definition  of  a  new  function;  its 
syntax  calls  for  a  name  for  the  procedure,  followed  by  a  parenthesized  list  of  arguments,  and 
the  body  indented  in  braces.  Static  scoping  is  assumed. 

Learn  works  by  recursively  boosting  the  accuracy  of  its  hypotheses.  Learn  typically  calls 
itself  three  times  using  the  three  simulated  example  oracles  described  in  the  preceding  section. 
On  each  recursive  call,  the  required  error  bound  of  the  constructed  hypotheses  comes  closer  to 
1/2;  when  this  bound  reaches  ±  the  weak  learning  algorithm  WeakLearn  can  be  used. 

The  procedure  takes  measures  to  limit  the  run  time  of  the  simulated  oracles  it  provides 
on  recursive  calls.  When  Learn  calls  itself  a  second  time  to  find  h2,  the  expected  number  of 
iterations  of  EX2  to  find  an  example  depends  on  the  error  of  hx,  which  is  estimated  by  If 
hi  already  has  the  desired  accuracy  1  —  (,  then  there  is  no  need  to  find  /i2  and  h3  since  hi  is 
a  sufficiently  good  hypothesis;  otherwise,  if  Oj  =  fi(e),  then  it  can  be  shown  that  El Y2  will  not 
loop  too  long  to  find  an  instance.  Similarly,  when  Learn  calls  itself  to  find  /i3,  the  expected 
number  of  iterations  of  EX3  depends  on  how  often  hi  and  h3  disagree,  which  we  will  see  is 
in  turn  a  function  of  the  error  of  h 2  on  the  original  distribution  D.  If  this  error  e  (which  is 
estimated  by  e)  is  small,  then  /i2  is  a  good  hypothesis  and  is  returned  by  Learn.  Otherwise,  it 
will  be  shown  that  EX3  also  will  not  run  for  too  long. 
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2-3.3  Correctness 

We  show  in  this  section  that  the  algorithm  is  correct  in  the  following  sense: 

Theorem  3.2  For  0  <  e  <  1/2  and  for  0  <  6  <  1,  the  hypothesis  returned  by  calling 
Learn(e,  6,  EX)  is  e-close  to  the  target  concept  with  probability  at  least  1  —  6. 

Proof:  In  proving  this  theorem,  we  will  find  it  useful  to  assume  that  nothing  “goes  wrong” 
throughout  the  execution  of  Learn.  More  specifically,  we  will  say  that  Learn  has  a  good  run 
if  every  hypothesis  returned  by  Weak  Learn  is  indeed  ^  -  p(r|  -close  to  the  target  concept, 
and  if  every  statistical  estimate  (i.e.,  of  the  quantities  at  and  e )  is  obtained  with  the  required 
accuracy.  We  will  then  argue  inductively  on  the  depth  of  the  recursion  that  if  Learn  has  a 
good  run  then  the  output  hypothesis  is  e-close  to  the  target  concept,  and  furthermore,  that  the 
probability  of  a  good  run  is  at  least  1  -  6.  Together,  these  facts  clearly  imply  the  theorem's 
statement. 

The  base  case  that  e  >  |  —  >  is  trivially  handled  using  our  assumptions  about  WeakLearn. 

In  the  general  case,  by  inductive  hypothesis,  each  of  the  three  (or  fewer)  recursive  calls  to 
Learn  are  good  runs  with  probability  at  least  1-6/5.  Moreover,  each  of  the  estimates  di  and 
e  has  the  desired  accuracy  with  probability  at  least  1-6/5.  Thus,  the  chance  of  a  good  run  is 
at  least  the  chance  that  all  five  of  these  events  occur,  which  is  at  least  1  —  6. 

It  remains  then  only  to  show  that  on  a  good  run  the  output  hypothesis  has  error  at  most  c. 

An  easy  special  case  is  that  dj  or  e  is  found  to  be  smaller  than  e  -  rj  or  e  -  r2,  respectively. 
In  either  case,  it  follows  immediately,  due  to  the  accuracy  with  which  and  e  are  assumed 
to  have  been  estimated,  that  the  returned  hypothesis  is  e-close  to  the  target  concept.  (For 
instance,  if  e  <  e  -  r2,  then  e  =  Prr€D[/i2(x)  c(x)]  <  e,  and  thus  the  returned  hypothesis  h2 
is  e-close  to  c.) 

Otherwise,  in  the  general  case,  all  three  subhypotheses  must  be  found  and  combined.  Let 
a;  be  the  error  of  d,  under  D;.  Here.  D  is  the  distribution  of  the  provided  oracle  EX,  and  £), 
is  the  distribution  induced  by  oracle  £X<  on  the  z'th  recursive  call  (i  =  1.2,3).  By  inductive 
hypothesis,  each  a*  <  a. 

In  the  special  case  that  all  hypotheses  are  deterministic,  the  distributions  D\  and  D2  can  be 
depicted  schematically  as  shown  in  Figure  3.  The  figure  shows  the  portion  of  each  distribution 
for  which  the  hypotheses  h\  and  h2  agree  with  the  target  concept  c.  For  each  distribution,  the 
top  crosshatched  bar  represents  the  relative  fraction  of  the  instance  space  for  which  h\  agrees 
with  c;  the  bottom  striped  bar  represents  those  instances  for  which  h2  agrees  with  c.  Although 
only  valid  for  deterministic  hypotheses,  this  figure  may  be  helpful  for  motivating  one’s  intuition 
in  what  follows. 
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h\=c 


hi=C 


Figure  3:  The  distributions  D 7  and  D2. 

Let  Pi(x)  =  Pr[/i,(x)  ^  c(x)]  be  the  chance  that  some  fixed  instance  x  is  misclassified 
by  h,.  (Recall  that  hypotheses  may  be  randomized,  and  therefore  it  is  necessary  to  consider 
the  probability  that  a  particular  fixed  instance  is  misclassified.)  Thus,  a<  =  fx*  pi(x)dDi(x). 
Similarly,  let  q(x)  =  Prfh^x)  ^  /i2(^)]  be  the  chance  that  x  is  classified  differently  by  hx  and  h2. 
Also  define  tn,7  as  follows: 

w00  =  Prl€D[(Mx)  #  c(x))  A  (Mx)  #  c(x))] 

Woi  =  Prxeo[(^i(x)  c(x))  A  (h2(x)  =  c(x))] 

w\0  =  Prr€D[(Mx)  =  c(x))  A  (h2(x)  ^  c(x))] 

t»n  =  Pr,€o[(/ii(i)  =  c(x))A  (A2(x)  =  c(x))] 


Note  that  the  first  subscript  determines  whether  hx  is  correct,  and  the  second  subscript  whether 
h2  is  correct.  Thus,  for  i,j  £  {0, 1},  we  have 

=  I  \i- Pi(x)\-\j  -  p2(x)\dD{x). 

J  x* 


Clearly, 


Woo  +  W01  =  Prr€0(/ii(x)  ^  c(x)|  =  Cj, 


(3.1) 
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and  also 

Woo  +  u»oi  +  tn,0  +  =  1-  (3.2) 

In  terms  of  these  variables,  we  can  express  explicitly  the  chance  that  EX,  returns  an  instance 
from  any  measurable  set  A  C  Xn: 


DM) 

DM) 

DM) 


D(A) 


iPrr€D[i  €  A  |  hx(x)  ^  c(x)]  +  iPrr€D[z  €  A  |  hj(z)  =  c(z)j 

J.\  2ai  +  2(1  - ai)J  {  1 

Prt6D[i  €  A  |  hi(x)  ^  h2(x)} 

q(x) 


L 


A  W oi  +  W io 


dD(x). 


(3.3) 

(3.4) 

(3.5) 


In  (3.5),  we  have  used  the  fact  that  Pr^cf/i^x)  ^  /i2(z)]  =  it>0i  +  w10  since  c,  /i,  and  h2  are 
Boolean  valued. 

From  equation  (3.4),  we  have  that 


1  —  a2 


=  /„< 


(1 


p2(x))dD2(x) 
'Pi(z) 


2a i  '  2(1 

=  «“/  Pi(*)(l  -  pM))dD(x)  + 

*a\  Jx. 


1 


2(1  -ax) 


/  (1  -p,(x))(1  -  p2(x))dD{x) 

Jx„ 


Woi 


+ 


U7U 


2a,  2(1 -a,) 


(3.6) 


To  see  that  the  second  equality  follows  from  (3.4),  see,  for  instance,  Theorem  16.10  of  Billings¬ 
ley  [12].  Also,  note  that  (3.6)  could  have  been  derived  from  Figure  3  in  the  case  of  deterministic 
hypotheses:  if  /?  is  as  shown  in  the  figure,  then  it  is  not  hard  to  see  that  ie0i  =  2 a,/?  and 
tn,i  =  2(1  —  a,)(l  —  a2  —  (3).  These  equalities  imply  (3.6). 

Combining  equations  (3.1),  (3.2)  and  (3.6),  we  see  that  the  values  of  u;,0  and  tn0o  can  be 
solved  for  and  written  explicitly  in  terms  of  Woi,  a,  and  a2: 


w10  —  1  —  a,  —  W\ , 

=  1  -  a,  -  2(1  -  a,)(l  -  g,)  +  U'0'(^~  ai> 

=  {2a,  —  1 )( 1  —  <a, )  +  — — — 

a. 

Woo  - 


a,  -  in0i 


(3.7) 

(3.8) 
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We  are  finally  ready  to  compute  the  error  of  the  output  hypothesis  h.  Recall  that  h  is 
correct  if  and  only  if  hx  and  h2  agree  in  their  predictions  but  are  incorrect,  or  if  hx  and  h2 
disagree  and  h3  is  incorrect.  Thus  the  error  of  h  is 

PfreotM*)  #  c(x)]  =  Prr6jp[[ft1(x)  =  h2(x)  ^  c(x)]  V  [hx(x)  ^  h2(x)  A  h3(x)  ^  c(x)]] 

=  Prr€£>[/ii(x)  ^  c(x)  A  h2(x)  ^  c(x)] 

+Prl€D[/ii(x)  ^  h2(x)  A  h3(x)  #  c(x)] 

=  Woo  +  /  q{x)p3(x)dD(x). 

Jx . 

By  equation  (3.5),  this  is  equal  to 

u>oo+  /  (u>io  +  w0i)p3(x)dD3(x)  =  1^00  +  03(^10  +  ^01) 

Jx. 

<  Woo  +  o(tUio  +  W01). 


Applying  equations  (3.7)  and  (3.8),  this  equals 
<*(2a2  —  1)(1  —  aj)  4-  ai  + - - - —  <  ot(2a2  —  1)(1  —  fli)  +  o 

ai 

<  a(2ct  -  1)(1  -  a)  +  a  =  3a2  -  2a3  =  g(a)  —  e 

as  desired.  The  inequalities  here  follow  from  the  facts  that  each  a,  <  a  <  1/2,  and  that,  by 
equation  (3.1),  ui0i  <  a,. 

This  completes  the  proof.  ■ 

A  footnote  on  measurability:  For  any  hypothesis  h  output  by  WeakLearn,  we  have  assumed 
implicitly  that  the  function  /j,(x)  =  Pr[/i(x)  ^  c(x)]  is  measurable.  (If  this  were  not  the  case, 
then  it  would  hardly  make  sense  to  discuss  the  error  of  h,  since  the  error  is  just  the  expected 
value  of  /h.)  By  induction,  the  same  can  be  proved  about  hypotheses  output  by  Learn:  Briefly, 
each  function  p,  used  in  the  proof  above  is  A-measurable  by  inductive  hypothesis.  Thus, 
by  (3.4)  and  (3.5),  p2  and  qp3  are  Immeasurable  (as  follows,  for  instance,  from  Billingsley  [12] 
Theorem  16.10).  Therefore,  the  error  function  fh  =  PiP2  +  qpz  is  also  D-measurable,  where  h  is 
the  output  hypothesis.  Note  that  these  facts  imply  that  all  the  integrals  used  in  the  preceding 
proof  are  defined. 

2-3.4  Analysis 

In  this  section,  we  argue  that  Learn  runs  in  polynomial  time.  Here  and  throughout  this  section, 
unless  stated  otherwise,  polynomial  refers  to  polynomial  in  n,  s,  1/e  and  1/6.  Our  approach  will 
be  to  first  derive  a  bound  on  the  expected  running  time  of  the  procedure,  and  to  then  use  a  part 
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of  the  confidence  6  to  bound  with  high  probability  the  actual  running  time  of  the  algorithm. 
Thus,  we  will  have  shown  that  the  procedure  is  probably  fast  and  correct,  completing  the 
proof  of  Theorem  3.1.  (Although  technically  we  only  show  that  Learn  halts  probabilistically, 
using  techniques  described  by  Haussler  et  al.  [38],  the  procedure  can  easily  be  converted  into  a 
learning  algorithm  that  halts  deterministically  in  polynomial  time.) 

We  will  be  interested  in  bounding  several  quantities.  First,  we  are  of  course  interested 
in  bounding  the  expected  running  time  T(f,6)  of  Learn(c,  6,  EX).  This  running  time  in  turn 
depends  on  the  time  U(e,6)  to  evaluate  a  hypothesis  returned  by  Learn,  and  on  the  expected 
number  of  examples  A/(e,  6)  needed  by  Learn.  In  addition,  let  t(6),  u(6)  and  m(S)  be  analogous 
quantities  for  WeakLeam(6,  EX).  By  assumption,  t,  u  and  m  are  polynomially  bounded.  (All 
of  these  functions  also  depend  implicitly  on  n  and  s.) 

As  a  technical  point,  we  note  that  the  expectations  denoted  by  T  and  M  are  taken  only 
over  good  runs  of  Learn.  That  is,  the  expectations  are  computed  given  the  assumption  that 
every  subhypothesis  and  every  estimator  is  successfully  computed  with  the  desired  accuracy. 
By  Theorem  3.2,  Learn  will  have  a  good  run  with  probability  at  least  1  —  6. 

It  is  also  important  to  point  out  that  T  (respectively,  t)  is  the  expected  running  time  of  Learn 
(WeakLeam)  when  called  with  an  oracle  EX  that  provides  examples  in  unit  time.  Our  analysis 
will  take  into  account  the  fact  that  the  simulated  oracles  supplied  to  Learn  or  WeakLearn  at 
lower  levels  of  the  recursion  do  not  in  general  run  in  unit  time. 

We  will  see  that  T,  U  and  M  are  all  exponential  in  the  depth  of  the  recursion  induced  by 
calling  Learn.  We  therefore  begin  by  bounding  this  depth.  Let  B(c,p)  be  the  smallest  integer 
i  for  which  g'  (2  —  p)  5:  € •  On  each  recursive  call,  e  is  replaced  by  <7_1(f)-  Thus,  the  depth  of 
the  recursion  is  bounded  by  B(e,p(n,s)).  We  have: 

Lemma  3.3  The  depth  of  the  recursion  induced  by  calling  Learn (e,6,EX)  is  at  most 
B{e,p(n,s))  =  0(log(p(n.s))  +  log  log(  1/c)). 

Proof:  We  can  say  B(e,p(n,s))  <  b  +  c  if  gb  <  1/4  and  gc  (1/4)  <  e.  Clearly, 

g(x)  <  3z2  for  x  >  0,  and  so,  by  an  easy  induction  on  i,  g'(x)  <  (3z)2'/3.  Thus.  gc{  1/4)  <  t  if 
c=  [lglog4/3(l/3c)|. 

Similarly,  if  1/4  <  x  <  1/2  then  1/2  -  g(x)  =  (1/2  -  r)(l  +  2z  -  2x2)  >  (11/8)  (1/2  -  x). 
This  implies  (by  induction  on  i)  that  1/2  -  g'(x)  >  (11/8)' (1/2  -  x),  assuming  that 
x,p(x),...,p‘-1(x)  are  all  at  least  1/4.  Thus,  gb  <  1/4  if  b  =  [logn/g(p(n,s)/4)|. 

■ 

For  the  remainder  of  this  analysis,  we  let  p  =  p(n,s)  and,  where  clear  from  context,  let 
B  =  B(c,p).  Note  that  B(g~l(e),p)  =  B  -  1  for  e  <  ^  -  j. 

We  show  next  that  U  is  polynomially  bounded.  This  is  important  because  we  require  that 
the  returned  hypothesis  be  polynomially  evaluatable. 
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Lemma  3.4  The  time  to  evaluate  a  hypothesis  returned  by  Learn(c ,6,  EX)  is 
U((,6)  =  0(3®  •  u(6/b8)). 

Proof:  If  €  >  |  then  Learn  returns  a  hypothesis  computed  by  WeakLeara.  In  this  case, 
U(e,6)  =  Otherwise,  the  hypothesis  returned  by  Learn  involves  the  computation  of  at 

most  three  subhypotheses.  Thus, 

U(e,S)<3-U(g-l(c)tS/ 5)  +  c 

for  some  positive  constant  c.  A  straightforward  induction  argument  shows  that  this  recurrence 
implies  the  bound 

U(e,  6)  <  3bu(8/5b)  +  c(3b  -  1). 

■ 

When  an  example  is  .guested  of  a  simulated  oracle  on  one  of  Learn’s  recursive  calls,  that 
oracle  must  itself  d-av  several  examples  from  its  own  oracle  EX.  For  instance,  on  the  third 
recursive  call,  the  simulated  oracle  must  draw  instances  until  it  finds  one  on  which  hy  and  h2 
disagree.  Naturally,  the  running  time  of  Learn  depends  on  how  many  examples  must  be  drawn 
in  this  manner  by  the  simulated  oracle.  The  next  lemma  bounds  this  quantity. 

Lemma  3.5  Let  r  be  the  expected  number  of  examples  drawn  from  EX  by  any  oracle  EXi 
simulated  by  Learn  on  a  good  run  when  asked  to  provide  a  single  example.  Then  r  <  4/e. 

Proof:  When  Learn  calls  itself  the  first  time  (to  find  hi),  the  examples  oracle  EX  it  was  passed 
is  left  unchanged.  In  this  case,  r  =  1. 

The  second  time  Learn  calls  itself,  the  constructed  oracle  EX2  loops  each  time  it  is  called 
until  it  receives  a  desirable  example.  Depending  on  the  result  of  the  initial  coin  flip,  we  expect 
EX-i  to  loop  1  jay  or  1/(1  —  aj)  times.  Note  that  if  ay  <  e~2ry  =  c/3  then,  based  on  its  estimate 
of  cti,  Learn  would  have  simply  returned  hy  instead  of  making  a  second  or  third  recursive  call. 
Thus,  we  can  assume  c/3  <  ax  <  1/2,  and  so  r  <  3/c  in  this  case. 

Finally,  when  Learn  calls  itself  the  third  time,  we  expect  the  constructed  oracle  EX3  to  loop 
lAwio  +  tuoi)  times  before  finding  a  suitable  example,  since  tn10  +  u;oi  is  exactly  the  chance  that 
hx  and  h2  disagree  in  their  classifications.  (Here,  the  variables  Wyj  are  as  defined  in  the  proof 
of  Theorem  3.2.)  It  remains  then  only  to  show  that  w1Q  +  w0\  >  c/4. 

Note  that  the  error  e  of  h2  on  the  original  distribution  D  is  u>i0  4-  w0o-  Thus,  using  this  fact 
and  equations  (3.1),  (3.2)  and  (3.6),  we  can  solve  explicitly  for  w10  +  w0y  in  terms  of  e,  ax  and 
a2.  Specifically,  (3.1)  combined  with  the  fact  that  e  =  wi0  +  tr0o  gives 


Wyo  -  u>oi  =  e  -  dj, 


(3.9) 
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and  (3.1),  (3.2)  and  (3.6)  together  imply  that 


or  equivalently, 


_  Wqi  1  —  <*1  —  WjQ 

~°2~2  a,  +  2(1 -a,) 


2(1  -  ai)u>oi  -  2a!U>io  =  2aj(l  -  a^l  -  2a2). 


Combined  with  (3.9),  this  implies 


(1  -  2a1)(ti>oi  +  ti>io)  =  e  -  at  +  2ai(l  -  ai)(l  -  2a2) 


and  so 


Woi  +  U-’io  — 


c  —  a.\  - f-  2ai(l  —  fli)(l  —  2a2) 


—  CL\  + 


>  a.\  + 


1  -  2a! 
e  -  4aiQ2(l  - 
1  —  2q,\ 

e  -  4aia(l  -  at) 

1  -  2 fl!  ‘ 


(3.10) 


Regarding  e  and  a  <  1/2  as  fixed,  we  refer  to  this  last  function  on  the  right  hand  side  of  the 
inequality  as  /(a i).  To  lower  bound  wl0  -f  w01,  we  find  the  minimum  of  /  on  the  interval  [0,a]. 

The  derivative  of  /  is: 

,  (4  -  8q)oi  -  (4  -  8Q)aj  +  ( 1  -  4a  +  2e) 

(1  —  2ai)2 

The  denominator  of  this  derivative  is  clearly  zero  only  when  aj  =  1/2,  and  the  numerator, 
being  a  parabola  centered  about  the  line  ax  =  1/2,  has  at  most  one  zero  less  than  1/2.  Thus, 
the  function  /  has  at  most  one  critical  point  on  the  interval  (— oo,l/2).  Furthermore,  since  / 
tends  to  —  oo  as  ax  — »  —  oo,  a  single  critical  point  in  this  range  cannot  possibly  be  minimal.  This 
means  that  /’ s  minimum  on  any  closed  subinterval  of  (-oo.l/2)  is  achieved  at  one  endpoint 
of  the  subintervai.  In  particular,  for  the  subinterval  of  interest  to  us,  the  function  achieves  its 
minimum  either  when  ax  =  0  or  when  ax  =  a.  Thus,  iej0  +  w0i  >  min(/(0), /(a)). 

We  can  assume  that  e  >  c  -  2r2  =  (3/4  +  a/2 )c;  otherwise,  if  e  were  smaller  than  this 
quantity,  then  e  would  be  less  than  e  —  r2  and  so  Learn  would  have  returned  h2  rather  than 
going  on  to  compute  h3.  Thus,  /( 0)  —  e  >  3e/4.  Also,  using  our  bound  for  e  and  the  fact  that 
c  =  3a2  -  2a3,  we  have 


/(a)  =  a  + 


e  -  4a2(l  -  a) 
1  -  2a 
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a  -  6a2  +  4a3  +  e 
1  -  2a 

^  a  -  6a2  +  4a3  +  (3/4  +  a/2)  (3a2  -  2a3) 

-  1  -  2a 

=  ^a(4  -  7a  +  2a2). 

Since  4  — 7a + 2a2  >  1  fora  <  1/2,  /(a)  >  a/4  >  c/4.  We  conclude  u>io  +  u>oi  ^  c/4,  completing 
the  proof.  ■ 

To  bound  the  number  of  examples  needed  to  estimate  at  and  e,  we  make  use  of  the  following 
bounds  on  the  tails  of  a  binomial  distribution  [10,  44]. 

Lemma  3.6  (Chernoff  Bounds)  Let  be  a  sequence  of  m  independent  Bernoulli 

trials,  each  succeeding  with  probability  p  so  that  E[A\]  =  p.  Let  S  =  X\  +  •  •  •  +  Xm  be  the 
random  variable  describing  the  total  number  of  successes.  Then  for  0  <  7  <  1,  the  following 
hold: 

•  (additive  form)  Pr[5  >  (p  +  7)m]  <  e~2my* ,  and  Pr[S  <  (p  —  7)m]  <  e~2my3 ; 

•  (multiplicative  form )  Pr[S  >  (1  +  7)pm]  <  e~y3mp/3,  and  Pr[5  <  (1  —  7)pm]  <  e-7"mp/2. 

The  additive  form  (also  known  as  Hoeffding’s  inequality)  holds  also  if  Xu . .  .,Xm  are  indepen¬ 
dent  identically  distributed  random  variables  with  range  in  [0, 1]. 

Lemma  3.7  On  a  good  run,  the  expected  number  of  examples  M(e,  6)  needed  by  Learn(c,  S,  EX) 
is  O  ■  (p2  \og(bB/6)  4-  m(£/5fl)/j . 

Proof:  In  the  base  case  that  c  >  |  Learn  simply  calls  WeakLearn,  so  we  have  M(e,6)  = 
m(6).  Otherwise,  on  each  of  the  recursive  calls,  the  simulated  oracle  is  required  to  provide 
M(g~l(e),6/b)  examples.  To  provide  one  such  example,  the  simulated  oracle  must  itself  draw 
at  most  an  average  of  4/c  examples  from  EX.  Thus,  each  recursive  call  demands  at  most 
(4/c)  •  M(g~l(e),6/ 5)  examples  on  average. 

In  addition,  Learn  requires  some  examples  for  making  its  estimates  dj  and  e.  Using  the  addi¬ 
tive  form  of  Lemma  3.6,  it  follows  that  a  sample  of  size  <3(log(l/d)/r2)  suffices  for  each  estimate, 
for  i  =  1.2.  Note  that  l/p<  l/2-c=  1/2-  g(a)  =  (1/2  -  a)(l  +  2a  -  2a2)  <  (3/2)(  1/2  -  a). 
Thus,  by  our  choice  of  77  and  r2,  both  estimates  can  be  made  using  cp2log(l/d)/c2  examples, 
for  some  positive  constant  c. 

We  thus  arrive  at  the  recurrent  inequality: 

M(c,d)  <  ^  •M(g-1(c),d/5)+  cP2lojl1/6),  (3.11) 
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To  complete  the  proof,  we  argue  inductively  that  inequality  (3.11)  implies  the  bound 


M(e,6)  < 


36s  •  m(6/5B)  -f-  c( 36s  -  \)p2\og(5B /S) 


(3.12) 


The  base  case  ( B  =  0)  clearly  satisfies  this  bound.  In  the  general  case,  equation  (3.11)  implies 
b>  inductive  hypothesis  that 

Af(c  6)  <  -  ■  f36*"1  •  m(6/5B)  +  c(36*-*  ~  l)p2log(5B/d)j  +  cp2log(l/d) 


~  e  L  GrHOr  J 

<  12  h6B~'  ■  m(6/ 5b)  +  c(36fl~1  -  l)p2log(5B/^)1  +  cp2log(l/d) 


36s  •  m(£/5fl)  +  c(36s  -  l)p2log(5s/d)  cp: 


+  ^-(log(l/^)-351og(5fl/5)) 


which  clearly  implies  (3.12).  The  last  inequality  here  follows  from  the  fact  that  c  <  3(p-1(e))2 
since  5(0)  <  3a2  for  o  >  0.  ■ 

Lemma  3.8  On  a  good  run.  the  expected  execution  time  o/Learn(f,  S.  EX)  is  given  by  T(c,  S)  = 
O  •  t(6/5B)  +  -Q---^2(6/5B)  •  (p2 log(5B/6)  +  m(d/5*)))  . 

Proof:  As  in  the  previous  lemmas,  the  base  case  that  e  >  5  —  j  is  easily  handled.  In  this  case, 
T(e,6)  =  t(S). 

Otherwise,  Learn  takes  expected  time  3  ■  T(g~i(e),6/5)  on  its  three  recursive  calls.  In 
addition.  Learn  spends  time  drawing  examples  to  make  the  estimates  ax  and  e,  and  overhead 
time  is  also  spent  by  the  simulated  examples  oracles  passed  on  the  three  recursive  calls.  A  typical 
example  that  is  drawn  from  Learn’s  oracle  EX  is  evaluated  on  zero,  one  or  two  of  the  previously 
computed  subhypotheses.  For  instance,  an  example  drawn  for  the  purpose  of  estimating  ax  is 
evaluated  once  by  hr,  an  example  drawn  for  the  simulated  oracle  EX3  is  evaluated  by  both  hx 
and  h2.  Thus,  Learn’s  overhead  time  is  proportional  to  the  product  of  the  total  number  of 
examples  needed  by  Learn  and  the  time  it  takes  to  evaluate  a  subhypothesis  on  one  of  these 
examples.  Therefore,  the  following  recurrence  holds: 


<  3  •  T(g-l(e),6/b)  +  c  •  U{g~l(c),  <5/5)  •  M(e,6) 


(3.13) 


for  some  positive  constant  c.  Applying  Lemmas  3.4  and  3.7,  this  implies 

T(e,6)  <  3  •  T{g-'(c),6/o)  +  c'  ’  'MB_u(S/bB)  {p2log{5B/S)  +  m(*/5*))  (3.14) 


38  The  Strength  of  Weak  Learnability 


for  some  positive  constant  c'.  An  induction  argument  shows  that  this  implies: 

T(e, 6)<  3B  •  t(6/bB)  +  2C'  '  108*a  u(6/hB)  .  (p2 log(5s/^)  +  m(6/ 5fl))  .  (3.15) 

Clearly,  the  base  case  ( B  =  0)  satisfies  this  inequality.  In  general,  equation  (3.14)  implies  using 
our  inductive  hypothesis  that 


T(e,S)  < 


3b  •  t(6/5B)  + 

3  •  2c'  •  108s-1  •  u(6/bB)  c'  •  108s  •  u(6/bB) 

g-l(*r 


■  (p2log(bB/6)  +  m(S/bB)) . 


Since  </-1(e)  >  e,  this  clearly  implies  equation  (3.15),  completing  the  induction.  ■ 

The  main  result  of  this  section  follows  immediately: 


Theorem  3.9  Let  0  <  c  <  1/2  and  let  0  <  6  <  1.  With  probability  at  least  1  —  6,  the  execution 
of  Learn(e,  6/2,  EX)  halts  in  time  polynomial  in  1/e,  1/6,  n  and  s,  and  outputs  a  hypothesis 
e-close  to  the  target  concept. 

Proof:  By  Theorem  3.2,  the  chance  that  Learn(c,  6/2,  EX)  does  not  have  a  good  run  is  at 
most  6/2.  By  Markov’s  inequality,  the  chance  that  Leara(c,£/2,  EX)  on  a  good  run  fails  to 
halt  in  time  (2/6)  ■  T(e,6/2)  is  also  at  most  6/2.  Thus,  using  Lemma  3.8,  the  probability  that 
Leam(e,  6/2,  EX)  has  a  good  run  (and  so  outputs  an  e-close  hypothesis)  and  halts  in  polynomial 
time  is  at  least  1  -  6.  ■ 


2-3.5  Space  complexity 

Although  not  of  immediate  consequence  to  the  proof  of  Theorem  3.9,  it  is  worth  pointing  out 
that  Learn’s  space  requirements  are  relatively  modest,  as  proved  in  this  section. 

Let  S(e,  6)  be  the  space  used  by  Learn(e,£,  EX);  let  (J(e,£)  be  the  space  needed  to  store  an 
output  hypothesis;  and  let  R(e,6)  be  the  space  needed  to  evaluate  such  a  hypothesis.  Let  s(6), 
q(6)  and  r(6 )  be  analogous  quantities  for  WeakLearn(6,  EX).  (Each  of  these  measures  worst-case 
space  complexity.)  Then  we  have: 

Lemma  3.10  The  space  Q(e,6)  required  to  store  a  hypothesis  output  by  Learn (e,6,EX)  is  at 
most  0(3B  -q(6/bB)).  The  space  R(e,6)  needed  to  evaluate  such  a  hypothesis  is  0(B  +  r(6/bB)). 
Finally,  the  total  space  S(e,6)  required  by  Learn  is  0( 3s  •  q(6/bB)  +  s(6/bB)  +  B  ■  r(6/b8)). 

Proof:  For  e  >  |  the  bounds  are  trivial.  To  bound  Q,  note  that  the  hypothesis  returned 
by  Learn  is  a  composite  of  three  (or  fewer)  subhypotheses.  Thus, 


Q(e,6)<3-Q(g-i(e),6/b)  +  0(l). 


2-4 


Improving  Leaxn’s  time  and  sample  complexity  39 


To  evaluate  such  a  composite  hypothesis,  each  of  the  subhypotheses  is  evaluated  one  at  a  time. 
Thus, 

R(e,6)<R{g-\e),6/b)  +  0(  1). 

Finally,  to  bound  5,  note  that  the  space  required  by  Learn  is  dominated  by  the  storage  of 
the  subhypotheses,  by  their  recursive  computation,  and  by  the  space  needed  to  evaluate  them. 
Since  the  subhypotheses  are  computed  one  at  a  time,  we  have: 

S(e, 6)  <  S(9-\e),  6/5)  +  0  ( Q(g~'(e ),  6/5)  +  R(g~l(e),  6/5)) . 

The  solutions  of  these  three  recurrences  are  all  straightforward,  and  imply  the  stated  bounds. 


2-4  Improving  Learn’s  time  and  sample  complexity 

In  this  section,  we  describe  a  modification  to  the  construction  of  Section  2-3  that  significantly 
improves  Learn’s  time  and  sample  complexity.  In  particular,  we  improve  these  complexity 
measures  by  roughly  a  factor  of  1/c,  giving  bounds  that  are  linear  in  1/c  (ignoring  log  factors). 
These  improved  bounds  will  have  some  interesting  consequences,  described  in  later  sections. 

In  the  original  construction  of  Learn  given  in  Figure  2,  much  time  and  many  examples 
are  squandered  by  the  simulated  oracles  EX,  wailing  for  a  desirable  instance  to  be  drawn. 
Lemma  3.5  showed  that  the  expected  time  spent  waiting  is  0(  1/c).  The  modification  described 
below  will  reduce  this  to  0(l/a)  =  0(1/^).  (Here,  a  =  <7-1(e)  as  before.) 

Recall  that  the  running  time  of  oracle  EX2  depends  on  the  error  ax  of  the  first  subhypoth¬ 
esis  h\ .  In  the  original  construction,  we  ensured  that  ax  not  be  too  small  by  estimating  its 
value,  and,  if  smaller  than  c,  returning  hi  instead  of  continuing  the  normal  execution  of  the 
subroutine.  Since  this  approach  only  guarantees  that  ax  >  fl(c),  there  does  not  seem  to  be 
any  way  of  ensuring  that  £X2  run  for  o(  1/c)  time.  To  improve  EX2’s  running  time  then,  we 
will  instead  modify  hx  by  deliberately  increasing  its  error.  Ironically,  this  intentional  injection 
of  error  will  have  the  effect  of  improving  Learn’s  worst-case  running  time  by  limiting  the  time 
spent  by  either  £X2  or  EX 3  waiting  for  a  suitable  instance. 

2-4.1  The  modifications 

Specifically,  here  is  how  Learn  is  modified.  Call  the  new  procedure  Learn'.  After  the  recursive 
computation  of  hx,  Learn'  estimates  the  error  ax  of  hx,  although  less  accurately  than  Learn. 
Let  dj  be  this  estimate,  and  choose  a  sample  large  enough  that  |ai  -  <  a/4  with  probability 

at  least  1  -  6/5.  Since,  on  a  good  run,  ax  <  a,  we  can  assume  without  loss  of  generality  that 
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a,  <  3a/4.  (For  if  hx  >  3a/4,  then  >  a/2  (assuming  ax  has  the  desired  accuracy);  thus,  in 
this  case,  |a!  -  3a/4|  <  a/4  and  at  can  be  replaced  by  3a/4.) 

Next,  Learn'  defines  a  new  hypothesis  h\  as  follows:  given  an  instance  x,  h\  first  flips  a  coin 
biased  to  turn  up  “heads”  with  probability  exactly 


0  = 


1  -  ia-di‘ 


If  the  outcome  of  this  coin  flip  is  “tails,”  then  h\  evaluates  hi(x)  and  returns  the  result.  Oth¬ 
erwise,  if  “heads,”  h[  predicts  the  wrong  answer,  ->c(x).  Since  h\  will  only  be  used  during  the 
training  phase,  we  can  assume  that  the  correct  classification  of  x  is  available,  and  thus  that  h\ 
can  be  simulated.  Also,  note  that  0  <  0  <  1  since  dt  <  3a/ 4. 

This  new  hypothesis  h\  is  now  used  in  place  of  hi  by  £X2  and  EX3.  The  rest  of  the  procedure 
Learn  is  unmodified.  In  particular,  the  final  returned  hypothesis  h  is  unchanged — that  is,  hi, 
not  h\,  is  used  by  h. 

One  other  modification  is  made  to  improve  Learn’s  time  and  sample  complexity:  after  h7  is 
computed,  its  error  e  with  respect  to  D  is  estimated  in  a  slightly  different  manner.  Specifically, 
we  make  an  estimate  e  with  the  following  properties: 


•  if  e  <  e  -  2 r2,  then  e  <  e  -  r2  with  probability  at  least  1-6/5;  and 


•  if  e  >  (  -  2 r2,  then  e  >  (1  -  r2/c)e  with  probability  at  least  1-6/5. 

We  will  see  that  such  an  estimate  has  all  of  the  needed  properties,  but  requires  a  significantly 
smaller  sample. 


2-4.2  Correctness 

To  see  that  Learn'  is  correct,  we  will  assume  as  in  the  proof  of  Theorem  3.2  that  a  good  run 
occurs;  this  will  be  the  case  with  probability  at  least  1-6.  If  e  <  c  —  r2  so  that  h  —  h2  is 
returned,  then  either  e  <  e  -  2 r2,  or  e  is  such  that  e  <  e  •  e/(e  —  r2).  In  either  case,  the  returned 
hypothesis  h7  clearly  has  error  e  <  (. 

Otherwise,  note  that  the  error  of  h\  is  exactly  a\  =  (1  —  /3)aj  +  (3  since  the  chance  of  error 
is  dj  on  “tails,”  and  is  1  on  “heads.”  By  our  choice  of  /?,  we  have  that: 


a\  <  (1  -(})(&!  + \a)  +  0 

—  di  +  ^a  +  (1  —  fli  ~  \Q)0 
=  di  +  \a  +  fa  -  ax  =  a, 
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and  also 


a'i  >  (1  -^)(di  -  \a)  +  (3 
=  di  -  \a  +  (1  -  d!  +  \a)/3 
>  ~  +  (1  -  di  -  \at)P 

=  ax  -  ±a  +  fa  -  ax  =  \a. 


Thus,  a/2  <  a\  <  a. 

Let  h!  be  the  same  hypothesis  as  h,  except  with  h\  used  in  lieu  of  h\.  Note  that  h',  h\,  h3 
and  h3  are  related  to  one  another  in  exactly  the  same  way  that  h,  hi,  h3  and  h3  are  related 
in  the  original  proof  of  Theorem  3.2.  That  is,  if  we  imagine  that  h\  is  returned  on  the  first 
recursive  call  of  the  original  procedure  Learn,  then  it  is  not  impossible  that  /i2  and  h3  would 
be  returned  on  the  second  and  third  recursive  calls,  in  which  case  h'  would  be  the  returned 
hypothesis.  Put  another  way,  the  proof  that  h'  has  error  at  most  g(a)  =  f  is  an  identical  copy  of 
the  one  given  in  the  proof  of  Theorem  3.2,  except  that  all  occurrences  of  h  and  are  replaced 
by  h'  and  h\. 

Finally,  we  must  show  that  h's  error  is  at  most  that  of  h1.  Let  Pi(x)  =  Pr[/ii(x)  ^  c(x)], 
and  let  p<(x)  be  as  in  Theorem  3.2.  Then  for  x  €  X„,  we  have 

Pr[h'(x)  /  c(x)]  =  p'i(x)[(l  -  p2(x))p3(x)  +  p2(x)(l  -  p3(x))]  +  p2(x)p3(x) 

>  Pl(^)[(l  -  P2(x))p3(x)  +  p2(^)(l  -  P3(*))]  +  P2(x)p3(x) 

=  Pr  [h(x)jtc(x)} 

where  the  inequality  follows  from  the  observation  that  p'j(x)  =  (1  -  /?)pi(x)  +  (5  >  pi(x).  This 
implies  that  the  error  of  h  is  at  most  the  error  of  h' ,  which  is  bounded  by  e. 

2-4.3  Analysis 

Next,  we  show  that  Learn'  runs  faster  using  fewer  examples  than  Learn.  We  use  essentially 
the  same  analysis  as  in  Section  2-3.4.  The  following  three  lemmas  are  modified  versions  of 
Lemmas  3.5,  3.7  and  3.8.  The  proofs  of  the  other  lemmas  apply  immediately  to  Learn'  with 
little  or  no  modification,  and  so  are  omitted. 

Lemma  4.1  Let  r  be  the  expected  number  of  examples  drawn  from  EX  by  any  oracle  £X, 
simulated  by  Learn'  on  a  good  run  when  asked  to  provide  a  single  example.  Then  r  <  4/a. 

Proof:  As  in  the  original  proof,  r  =  1  for  EX\.  We  expect  the  second  oracle  EX7  to  loop  at 
most  l/a[  times  on  average.  Since,  as  noted  above,  a/2  <  a\  <  a,  r  is  at  most  2/a  in  this  case. 
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Finally,  to  bound  the  number  of  iterations  of  EX3,  we  will  show  that  u;10  +  Woi  >  ot/4  using 
equation  (3.10)  as  in  the  original  proof.  To  lower  bound  wi0  +  w0i,  we  find  the  minimum  of 
the  last  formula  /  of  (3.10)  (with  at  replaced  by  a\  of  course)  on  the  interval  [a/2, a].  As 
noted  previously,  the  function  /  must  achieve  its  minimum  at  one  endpoint  of  the  interval.  We 
assume  as  in  the  original  proof  that  e  >  e  -  r2  and  thus  that  e  >  e  -  2 r2  =  (3/4  +  a/2)c.  It 
was  previously  shown  that  /(a)  >  a/4,  and,  by  a  similar  argument,  we  can  bound  /(a/2). 
Specifically,  since  e  >  (3/4  +  a/2)(3a2  -  2a3),  we  have  that 


/(a/2)  = 
> 


a  e  -  a2(2  -  a) 
2  +  1  -a 


a  ,  a“ 

2+“  +4(T^) 


This  completes  the  proof. 


Lemma  4.2  On  a  good  run,  the  expected  number  of  examples  M(e,  6)  needed  by  Learn'(c,  6,  EX) 

(o/?  b  \ 

—  •  (p2  log(5 B/6)  +  m(6/5B))\  . 

Proof:  The  proof  is  nearly  the  same  as  for  Lemma  3.7.  In  addition  to  incorporating  the  superior 
bound  given  by  Lemma  4.1  on  the  number  of  examples  needed  by  the  simulated  oracles,  we 
must  also  consider  the  number  of  examples  needed  to  estimate  a,  and  e.  The  first,  al5  can  be 
estimated  using  a  sample  of  size  0(log(l/£)/a2)  =  0(log(l/6)/f);  this  can  be  derived  using  the 
additive  form  of  Lemma  3.6,  and  by  noting  that  c  =  g(a)  <  3a2  for  a  >  0. 

Using  the  multiplicative  form  of  Chernoff  bounds,  we  argue  that  e  can  be  estimated  with 
the  desired  accuracy  using  a  sample  of  size  only  0(p2log(l/6)/e).  First,  if  e  <  e  -  2 r2,  then  the 
chance  that  e  exceeds  c  -  r2  when  derived  from  a  sample  of  size  m  is  at  most  the  probability 
of  more  than  (e  —  T2)m  successes  occurring  in  a  sequence  of  m  Bernoulli  trials,  each  succeeding 
with  probability  exactly  e  —  2 r2.  Applying  the  multiplicative  form  of  Lemma  3.6  (with  7  set  to 
r2/(e  —  2r2)),  it  follows  that  this  probability  is  at  most  b/b  for  m  proportionate  to  p2  log(l/£)/e. 
Thus,  for  such  a  choice  of  m,  if  e  <  e  -  2 r2,  then  e  <  e  -  r2  with  probability  at  least  1  -  ti/5. 

If  e  >  (  —  r2  then,  again  applying  the  multiplicative  form  of  Chernoff  bounds  (with  7  set 
to  r2/e),  we  see  that  e  >  (1  —  r2/e)e  with  probability  at  least  1  -  6/5  for  a  sample  of  size 
proportionate  to  p2  log(l/£)/c. 

Thus,  we  arrive  at  the  recurrence 


M(e,6)<-^~ 
9  *(«) 


M(g-\e),6/ 5)  + 


cp2  log(l/£) 
e 


for  some  positive  constant  c.  This  implies  the  stated  bound  by  an  argument  similar  to  that 
given  in  the  proof  of  Lemma  3.7.  ■ 
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Lemma  4.3  On  a  good  run,  the  expected  execution  time  of  Learn'(e,  6,  EX)  is  given  by  T(e,6)  = 
O  ^3B  •  t(6/ 5B)  +  •  (p2 log(5s/6)  +  m(*/5B))) . 

Proof:  This  bound  follows  from  the  recurrence  (3.13),  using  the  superior  bound  on  M  given 
by  Lemma  4.2.  ■ 

2-5  Variations  on  the  learning  model 

Next,  we  consider  how  the  main  result  relates  to  some  other  learning  models. 

2-5.1  Group  learning 

An  immediate  consequence  of  Theorem  3.1  concerns  group  leamability.  In  the  group  learning 
model,  the  learner  produces  a  hypothesis  that  need  only  correctly  classify  large  groups  of 
instances,  all  of  which  are  either  positive  or  negative  examples.  Kearns  and  Valiant  [49,  52] 
prove  the  equivalence  of  group  learning  and  weak  learning.  Thus,  by  Theorem  3.1,  group 
learning  is  also  equivalent  to  strong  learning. 

2-5.2  Miscellaneous  PAC  models 

Haussler  et  al.  [38]  describe  numerous  variations  on  the  basic  PAC  model,  and  show  that  all  of 
them  are  equivalent.  For  instance,  they  consider  randomized  versus  deterministic  algorithms, 
algorithms  for  which  the  size  s  of  the  target  concept  is  known  or  unknown,  and  so  on.  It  is  not 
hard  to  see  that  all  of  their  equivalence  proofs  apply  to  weak  learning  algorithms  as  well  (with 
one  exception  described  below),  and  so  that  any  of  these  weak  learning  models  are  equivalent 
by  Theorem  3.1  to  the  basic  PAC-learning  model. 

The  one  reduction  from  their  paper  that  does  not  hold  for  weak  learning  algorithms  concerns 
the  equivalence  of  the  one-  and  two-oracle  learning  models.  In  the  one-oracle  model  (used 
exclusively  in  this  chapter),  the  learner  has  access  to  a  single  source  of  positive  and  negative 
examples.  In  the  two-oracle  model,  the  learner  has  access  to  one  oracle  that  returns  only 
positive  examples,  and  another  returning  only  negative  examples.  The  authors  show  that  these 
models  are  equivalent  for  strong  learning  algorithms.  However,  their  proof  apparently  cannot  be 
adapted  to  show  that  one-oracle  weak  leamability  implies  two-oracle  weak  leamability  (although 
their  proof  of  the  converse  is  easily  and  validly  adapted).  This  is  because  their  proof  assumes 
that  the  error  (  can  be  made  arbitrarily  small,  clearly  a  bad  assumption  for  weak  learning 
algorithms.  Nevertheless,  this  is  not  a  problem  since  we  have  shown  that  one-oracle  weak 
leamability  implies  one-oracle  strong  leamability,  which  in  turn  implies  two-oracle  strong  (and 
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therefore  weak)  learnability.  Thus,  despite  the  inapplicability  of  Haussler  et  al.’s  original  proof, 
all  four  learning  models  are  equivalent. 

2-5.3  Fixed  hypothesis  spaces 

Much  of  the  PAC-learning  research  has  been  concerned  with  the  form  or  representation  of  the 
hypotheses  output  by  the  learning  algorithm.  Clearly,  the  construction  described  in  Section  2-3 
does  not  in  general  preserve  the  form  of  the  hypotheses  used  by  the  weak  learning  algorithm.  It 
is  natural  to  ask  whether  there  exists  any  construction  preserving  this  form.  That  is,  if  concept 
class  C  is  weakly  learnable  by  an  algorithm  using  hypotheses  from  a  class  H  of  representations, 
does  there  then  exist  a  strong  learning  algorithm  for  C  that  also  only  outputs  hypotheses 
from  'HI 

In  general,  the  answer  to  this  question  is  “no”  (modulo  some  relatively  weak  complexity 
assumptions).  As  a  simple  example,  consider  the  problem  of  learning  A:-term  DNF  formulas 
using  only  hypotheses  represented  by  fc-term  DNF.  (A  formula  in  disjunctive  normal  form 
(DNF)  is  one  written  as  a  disjunction  of  terms,  each  of  which  is  a  conjunction  of  literals,  a 
literal  being  either  a  variable  or  its  complement.)  Pitt  and  Valiant  [68]  show  that  this  learning 
problem  is  infeasible  if  RP  ^  NP  for  k  as  small  as  2. 

Nevertheless,  the  weak  learning  problem  is  solved  by  the  algorithm  sketched  below.  (A 
similar  algorithm  is  given  by  Kearns  [48].)  First,  choose  a  “large”  sample.  If  significantly  more 
than  half  of  the  examples  in  the  sample  are  negative  (positive),  then  output  the  “always  predict 
negative  (positive)”  hypothesis,  and  halt.  Otherwise,  we  can  assume  that  the  distribution  is 
roughly  evenly  split  between  positive  and  negative  examples.  Select  and  output  the  disjunction 
of  k  or  fewer  literals  that  misclassifies  none  of  the  positive  examples,  and  the  fewest  of  the 
negative  examples. 

We  briefly  argue  that  this  hypothesis  is,  with  high  probability,  (i  —  D(^f))-close  to  the 
target  concept.  First,  note  that  the  target  jfc-term  DNF  formula  is  equivalent  to  some  fc-CNF 
formula  [68].  (A  formula  in  conjunctive  normal  form  (CNF)  is  one  written  as  the  conjunction 
of  clauses,  each  clause  a  disjunction  of  literals.  If  each  clause  consists  of  only  k  literals,  then  the 
formula  is  in  fc-CNF.)  Next,  we  observe  that  every  clause  is  satisfied  by  every  assignment  that 
satisfies  the  entire  fc-CNF  formula.  Moreover,  since  the  formula  has  at  most  0(nk)  clauses,  by 
an  averaging  argument,  there  must  be  one  clause  not  satisfied  by  fl(l/n*)  of  the  assignments  (as 
weighted  by  the  target  distribution)  that  do  not  satisfy  the  entire  formula.  Thus,  there  exists 
some  disjunction  of  k  literals  that  is  correct  for  nearly  all  of  the  positive  examples  and  for  at 
least  fl(l/n*)  of  the  negative  examples.  In  particular,  the  output  hypothesis  has  this  property. 
Since  the  distribution  is  roughly  evenly  divided  between  positive  and  negative  examples,  this 
implies  that  the  output  hypothesis  is  roughly  (|  -  fl(^r  ))-close  to  the  target  formula. 
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2-5.4  Queries 

A  number  of  researchers  have  considered  learning  scenarios  in  which  the  learner  is  not  only  able 
to  passively  observe  randomly  selected  examples,  but  is  also  able  to  ask  a  “teacher”  various  sorts 
of  questions  or  queries  about  the  target  concept.  For  instance,  the  learner  might  be  allowed  to 
ask  if  some  particular  instance  is  a  positive  or  negative  example.  Angluin  [7]  describes  several 
kinds  of  queries  that  might  be  useful  to  the  learner.  The  purpose  of  this  section  is  simply  to 
point  out  that  the  construction  of  Section  2-3  is  applicable  even  in  the  presence  of  most  kinds 
of  queries.  That  is,  a  weak  learning  algorithm  that  depends  on  the  availability  of  certain  kinds 
of  queries  can  be  converted,  using  the  same  construction,  into  a  strong  learning  algorithm  using 
the  same  query  types. 

2-5.5  Many-valued  concepts 

In  this  chapter,  we  have  only  considered  Boolean-valued  concepts,  i.e.,  concepts  that  classify 
every  instance  as  either  a  positive  or  a  negative  example.  Of  course,  in  the  “real  world,” 
many  learning  tasks  require  classification  into  one  of  several  categories  (for  instance,  character 
recognition).  How  does  the  result  generalize  to  handle  many- valued  concepts? 

First  of  all,  for  learning  a  valued  concept,  it  is  not  immediately  clear  how  to  define 
the  notion  of  weak  learnability.  A  hypothesis  that  guesses  randomly  on  every  instance  will  be 
correct  only  l/k  of  the  time,  so  one  natural  definition  would  require  only  that  the  weak  learning 
algorithm  classify  instances  correctly  slightly  more  than  l/k  of  the  time.  Unfortunately,  under 
this  definition,  strong  and  weak  learnability  are  inequivalent  for  k  as  small  as  three.  As  an 
informal  example,  consider  learning  a  concept  taking  the  values  0,  1  and  2,  and  suppose  that 
it  is  “easy”  to  predict  when  the  concept  has  the  value  2,  but  “hard”  to  predict  whether  the 
concept’s  value  is  0  or  1.  Then  to  weakly  learn  such  a  concept,  it  suffices  to  find  a  hypothesis  that 
is  correct  whenever  the  concept  is  2,  and  that  guesses  randomly  otherwise.  For  any  distribution, 
this  hypothesis  will  be  correct  half  of  the  time,  achieving  the  weak  learning  criterion  of  accuracy 
significantly  better  than  1/3.  However,  boosting  the  accuracy  further  is  clearly  infeasible. 

Thus,  a  better  definition  of  weak  learnability  is  one  requiring  that  the  hypothesis  be  cor¬ 
rect  on  slightly  more  than  half  of  the  distribution,  regardless  of  k.  Using  this  definition,  the 
construction  of  Section  2-3  is  easily  modified  to  handle  many-valued  concepts. 


'  2-6  General  complexity  bounds  for  PAC  learning 

The  construction  derived  in  Sections  2-3  and  2-4  yields  some  unexpected  relationships  between 
the  allowed  error  c  and  various  complexity  measures  that  might  be  applied  to  a  strong  learning 
algorithm.  One  of  the  more  surprising  of  these  is  a  proof  that,  for  every  leamable  concept  class, 


46  The  Strength  of  Weak  Learnability 


there  exists  an  efficient  algorithm  whose  output  hypotheses  can  be  evaluated  in  time  polynomial 
in  log(l/c).  Furthermore,  such  an  algorithm's  space  requirements  are  also  only  poly-logarithmic 
in  1/c — far  less,  for  instance,  than  would  be  needed  to  store  the  entire  sample.  In  addition,  its 
time  and  sample  size  requirements  grow  only  linearly  in  1/c  (disregarding  log  factors). 

Theorem  6.1  If  C  is  a  leamable  concept  class,  then  there  exists  an  efficient  learning  algorithm 
for  C  that: 

•  requires  a  sample  of  size  i  •  pi(n,  s,  log(l/e),  log(l/<5)), 

•  halts  in  time  ^  •p2(n,a,log(l/c),log(l/^)), 

•  uses  space  p3(n, s,log(  1/c), log(l/6)),  and 

•  outputs  hypotheses  of  size  p<(n,s,log(l/c)),  evaluatable  in  time  p5(n,s,log(l/c)). 
for  some  polynomials  p1(  p2,  p3,  p4  and  p5. 

Proof:  Given  a  strong  learning  algorithm  A  for  C,  “hard-wire”  c  =  1/4,  thus  converting  A  into 
a  weak  learning  algorithm  A'  that  outputs  hypotheses  1/4-close  to  the  target  concept.  Now  let 
A"  be  the  procedure  obtained  by  applying  the  construction  of  Learn'  with  A!  plugged  in  for 
WeakLeam.  As  remarked  previously,  we  can  assume  without  loss  of  generality  that  A"  halts 
deterministically  in  polynomial  time.  Note,  by  the  lemmas  of  Sections  2-3  and  2-4  that  A" 
“almost”  achieves  the  resource  bounds  given  in  the  theorem,  the  only  problem  being  that  the 
bounds  attained  are  polynomial  in  1/6  rather  than  log(l/6)  as  desired. 

This  problem  is  alleviated  by  applying  the  construction  of  Haussler  et  al.  [38]  for  converting 
any  learning  algorithm  B  into  one  running  in  time  polynomial  in  log(l/£).  Essentially,  this 
construction  works  as  follows:  Given  inputs  n,  s,  c  and  6,  first  simulate  B  0(log(l/6))  times, 
each  time  setting  B's  accuracy  parameter  to  c/4  and  B's  confidence  parameter  to  1/2.  Save 
all  of  the  computed  hypotheses.  Next,  draw  a  sample  of  0(log(l/£)/c)  examples,  and  output 
the  one  that  misclassifies  the  fewest  examples  in  the  sample.  Haussler  et  al.  argue  that  the 
resulting  procedure  outputs  an  c-close  hypothesis  with  probability  1  -  6. 

Applying  this  construction  to  A" ,  we  obtain  a  final  procedure  that  one  can  verify  achieves 
all  of  the  stated  bounds.  ■ 

The  remainder  of  this  section  is  a  discussion  of  some  of  the  consequences  of  Theorem  6.1. 

2-6.1  Improving  the  performance  of  existing  algorithms 

These  bounds  can  be  applied  immediately  to  a  number  of  existing  learning  algorithms,  yielding 
improvements  in  time  and/or  space  complexity  (at  least  in  terms  of  c).  For  instance,  the  com¬ 
putation  time  of  Blumer  et  al.’s  algorithm  [14]  for  teaming  half-spaces  of  Rn,  which  involves  the 
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solution  of  a  linear  programming  problem  of  size  proportional  to  the  sample,  can  be  improved 
by  a  polynomial  factor  of  1/t.  The  same  is  also  true  of  Baum’s  [11]  algorithm  for  learning  unions 
of  half-spaces,  which  involves  finding  the  convex  hull  of  a  significant  fraction  of  the  sample. 

There  are  many  more  algorithms  for  which  the  theorem  implies  improved  space  efficiency. 
This  is  especially  true  of  the  many  known  PAC  algorithms  that  work  by  choosing  a  large 
sample  and  then  finding  a  hypothesis  consistent  with  it.  For  instance,  this  is  how  Rivest’s  [72] 
decision  list  algorithm  works,  as  do  most  of  the  algorithms  described  by  Blumer  et  al.  [14], 
as  well  as  Helmbold,  Sloan  and  Warmuth’s  [43]  construction  for  learning  nested  differences  of 
learnable  concepts.  Since  the  entire  sample  must  be  stored,  these  algorithms  are  not  terribly 
space  efficient,  and  so  can  be  dramatically  improved  by  applying  Theorem  6.1.  Of  course, 
these  improvements  typically  come  at  the  cost  of  requiring  a  somewhat  larger  sample  (by  a 
polynomial  factor  of  log(l/c)).  Thus,  there  appears  to  be  a  trade-off  between  sample  size  and 
space  (or  time)  complexity. 

2-6.2  Data  compression 

Blumer  et  al.  [13,  14]  have  considered  the  relationship  between  learning  and  data  compression. 
They  have  shown  that,  if  any  sample  can  be  “compressed” — i.e.,  represented  by  a  prediction 
rule  significantly  smaller  than  the  original  sample — then  this  compression  algorithm  can  be 
converted  into  a  PAC-leaming  algorithm. 

In  some  sense,  the  bound  given  in  Theorem  6.1  on  the  size  of  the  output  hypothesis  implies 
the  converse.  In  particular,  suppose  C  is  a  learnable  concept  class  and  that  we  have  been  given 
m  examples  (ii, c(xi)), (x2,c(x2)),  •  •.,(imi<:(im))  where  each  x<  G  Xn  and  c  is  a  concept  in  C„ 
of  size  s.  These  examples  need  not  have  been  chosen  at  random.  The  data  compression  problem 
is  to  find  a  small  representation  for  the  data,  i.e.,  a  hypothesis  h  that  is  significantly  smaller 
than  the  original  data  set  with  the  property  that  h(Xi)  =  c(x<)  for  each  x<.  A  hypothesis  with 
this  last  property  is  said  to  be  consistent  with  the  sample. 

Theorem  6.1  implies  the  existence  of  an  efficient  algorithm  that  outputs  consistent  hypothe¬ 
ses  only  poly-logarithmic  in  the  size  m  of  the  sample.  This  is  proved  by  the  following  theorem: 

Theorem  6.2  Let  C  be  a  learnable  concept  class.  Then  there  exists  an  efficient  algorithm  that, 
given  0  <  6  <  1  and  m  (distinct)  examples  of  a  concept  c  G  C„  of  size  s,  outputs  with  probability 
at  least  1  —  6a  deterministic  hypothesis  consistent  with  the  sample  of  size  polynomial  in  n,  s 
and  log  m. 

Proof:  Pitt  and  Valiant  [68]  show  how  to  convert  any  learning  algorithm  into  one  that  finds 
hypotheses  consistent  with  a  set  of  data  points.  The  idea  is  to  choose  e  <  1/m  and  to  run  the 
learning  algorithm  on  a  (simulated)  uniform  distribution  over  the  data  set.  Since  e  is  less  than 
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the  weight  placed  on  any  element  of  the  sample,  the  output  hypothesis  cannot  misclassify  even 
a  single  data  point.  Applying  this  technique  to  a  learning  algorithm  A  satisfying  the  conditions 
of  Theorem  6.1,  we  see  that  the  output  hypothesis  has  size  only  polynomial  in  n,  s  and  logm, 
and  so  is  far  smaller  than  the  original  sample  for  large  m. 

Technically,  this  technique  requires  that  the  learning  algorithm  output  deterministic  hy¬ 
potheses.  However,  probabilistic  hypotheses  can  also  be  handled  by  choosing  a  somewhat 
smaller  value  for  c,  and  by  “hard-wiring”  the  computed  probabilistic  hypothesis  with  a  se¬ 
quence  of  random  bits.  More  precisely,  set  e  =  1/2 m,  and  run  A  over  the  same  distribution  as 
before.  Assume  A  has  a  good  run.  Note  that  the  output  hypothesis  h  can  be  regarded  as  a 
deterministic  function  of  an  instance  x  and  a  sequence  of  random  bits  r.  Let  p  be  the  chance 
that,  for  a  randomly  chosen  sequence  r,  h(-,r)  misclassifies  one  or  more  of  the  instances  in 
the  sample.  For  such  an  r,  the  chance  is  certainly  at  least  1/m  that  an  instance  x  is  chosen 
(according  to  the  simulated  uniform  distribution  on  the  sample)  for  which  h(x,r)  /  c(x).  Thus, 
the  error  of  h  is  at  least  p/m.  By  our  choice  of  e,  this  implies  that  p  <  1/2,  or,  in  other  words, 
that  the  probability  that  a  random  sequence  r  is  chosen  for  which  h(-,r )  correctly  classifies 
all  of  the  m  examples  is  at  least  1/2.  Thus,  choosing  and  testing  random  sequences  r,  we  can 
quickly  find  one  for  which  the  deterministic  hypothesis  h(-,r)  is  consistent  with  the  sample. 
Finally,  note  that  the  size  of  this  output  hard-wired  hypothesis  is  bounded  by  |/i|  +  |r|,  and 
that  |r]  is  bounded  by  the  time  it  takes  to  evaluate  h,  which  is  poly-logarithmic  in  m.  ■ 

Discrete  domains 

Naturally,  the  notion  of  size  in  the  preceding  theorem  depends  on  the  underlying  model  of  com¬ 
putation,  which  we  have  left  unspecified.  However,  the  theorem  has  some  immediate  corollaries 
when  the  learning  problem  is  discrete,  i.e.,  when  every  instance  in  the  domain  Xn  is  encoded 
using  a  finite  alphabet  by  a  string  of  length  bounded  by  a  polynomial  in  n,  and  every  concept  in 
C  of  size  s  is  also  encoded  using  a  finite  alphabet  by  a  string  of  length  bounded  by  a  polynomial 
in  s. 

Corollary  6.3  Let  C  be  a  leamable  discrete  concept  class.  Then  there  exists  an  efficient  algo¬ 
rithm  that,  given  0  <  6  <  1  and  a  sample  as  in  Theorem  6.2,  outputs  with  probability  at  least 
1  -  6  a  deterministic  consistent  hypothesis  of  size  polynomial  in  n  and  s,  and  independent  of  m. 

Proof:  Since  we  assume  (without  loss  of  generality)  that  all  the  points  of  the  sample  are 
distinct,  the  sample  size  m  cannot  exceed  |A„|.  Since  log|A„|  is  bounded  by  a  polynomial  in 
n,  the  corollary  follows  immediately.  ® 

Applying  “Occam’s  Razor”  of  Blumer  et  al.  {13],  this  implies  the  following  strong  general 
bound  on  the  sample  size  needed  to  efficiently  learn  C.  Although  the  bound  is  better  than  that 
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given  by  Theorem  6.1  (at  least  in  terms  of  t),  it  should  be  pointed  out  that  this  improvement 
requires  the  sacrifice  of  space  efficiency  since  the  entire  sample  must  be  stored. 

Theorem  6.4  Let  C  be  a  leamable  discrete  concept  class.  Then  there  exists  an  efficient  learning 
algorithm  for  C  requiring  a  sample  of  size  0  (e~ 1  •  (p(n,  s)  +  log(  1  /£)))  for  some  polynomial  p. 

Proof:  Blumer  et  al.  [13]  describe  a  technique  for  converting  a  so-called  “Occam”  algorithm 
A  with  the  property  described  in  Corollary  6.3  into  an  efficient  learning  algorithm  with  the 
stated  sample  complexity  bound.  Essentially,  to  make  this  conversion,  one  simply  draws  a 
sample  of  the  stated  size  (choosing  p  appropriately),  and  runs  A  on  the  sample  to  find  a 
consistent  hypothesis.  The  authors  argue  that  the  computed  hypothesis,  simply  by  virtue 
of  its  small  size  and  consistency  with  the  sample,  will  be  e-close  to  the  target  concept  with 
high  probability.  (Technically,  their  approach  needs  some  minor  modifications  to  handle,  for 
instance,  a  randomized  Occam  algorithm;  these  modifications  are  straightforward.)  ■ 

Non-discrete  domains 

Littlestone  and  Warmuth  [59]  consider  the  relationship  between  learning  and  more  general  kinds 
of  data  compression  schemes  applicable  to  domains  that  are  not  necessarily  discrete.  Specifically, 
a  compression  scheme  is  an  algorithm  that  takes  as  input  a  sample  S  of  m  labeled  examples 
of  some  target  concept  c,  and  that  outputs  a  hypothesis  h  consistent  with  the  sample  S  and 
represented  over  the  alphabet  S.  In  other  words,  h  is  represented  by  a  sequence  of  examples 
from  the  sample  itself.  For  example,  if  the  domain  is  the  real  plane,  and  the  hypothesis  classifies 
points  as  positive  if  and  only  if  they  occur  inside  some  rectangle  determined  by  four  points  from 
the  sample,  then  h  is  naturally  represented  by  those  four  examples. 

The  kernel  size  of  the  hypothesis  is  just  the  length  of  the  sequence  of  examples  that  rep¬ 
resents  it.  Also,  it  is  often  convenient  to  allow  the  hypothesis  to  incorporate  “additional  in¬ 
formation”  —  say,  a  sequence  of  bits  providing  some  supplementary  information  about  the 
hypothesis.  The  size  of  the  hypothesis  is  then  its  total  length  in  symbols  over  the  alphabet 
5  U  {0, 1};  that  is,  its  length  is  equal  to  its  kernel  size  plus  the  length  of  any  additional  infor¬ 
mation.  As  usual,  we  require  that  the  compression  algorithm  A  run  in  polynomial  time,  and 
that  the  output  hypothesis  be  polynomially  evaluatable. 

Littlestone  and  Warmuth  [59]  show  that  if  there  exists  a  compression  scheme  algorithm  .4 
for  a  concept  class  C  that  outputs  hypotheses  significantly  smaller  than  the  sample  (say,  linear 
in  m"  for  some  constant  o  <  1)  then  A  can  be  used  as  a  learning  algorithm  and  C  is  learnable. 

As  a  consequence  of  Theorem  6.2.  we  can  show  that  the  converse  holds  as  well:  if  C  is 
learnable,  then  there  exists  a  compression  scheme  for  C  outputting  hypotheses  of  size  polynomial 
in  n,  s  and  log  m. 
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Theorem  6.5  Let  C  be  a  leamable  concept  class.  Then  there  exists  a  compression  scheme 
algorithm  for  C  that,  given  a  sample  S  of  m  examples  of  some  sizes  concept  c  €  C„,  outputs 
a  hypothesis  h  consistent  with  S,  and  represented  as  a  string  in  {0,1}*  x  S*  for  some  k  and  l 
jwlynomial  in  n,  s  and  logm. 

Proof:  Let  A  be  a  weak  learning  algorithm  forC.  The  key  point  in  this  proof  is  that  hypotheses 
output  by  A  can  be  trivially  represented  by  the  entire  set  of  examples  actually  received  by  the 
algorithm:  a  hypothesis  represented  in  this  manner  can  be  efficiently  evaluated  by  re-running 
A  on  the  sample  (given  by  the  hypothesis)  producing  the  “true”  hypothesis  of  A  which  can 
then  be  evaluated  on  a  given  instance.  (If  A  is  randomized,  we  also  include  the  random  bits 
that  were  used.)  Note  that  such  a  hypothesis  has  kernel  size  equal  to  the  (polynomial)  sample 
complexity  of  A. 

Next,  we  apply  the  hypothesis-boosting  construction  of  Sections  2-3  and  2-4,  and  we  then 
eliminate  any  dependence  on  6  in  the  size  of  the  output  hypothesis  using  the  technique  described 
in  the  proof  of  Theorem  6.1.  The  result  is  a  strong  learning  algorithm  for  C  that  outputs 
hypotheses  with  kernel  size  only  polynomial  in  n,  s  and  log(l/t).  Specifically,  the  hypothesis 
is  represented  by  the  examples  used  on  each  simulated  execution  of  A,  in  addition  to  a  bit 
string  describing  the  overall  structure  of  the  hypothesis  constructed  by  the  boosting  procedure. 
Finally,  this  hypothesis  (which  may  be  randomized)  is  converted  into  a  compression  scheme  as 
described  in  Theorem  6.2.  ■ 

Thus,  a  necessary  and  sufficient  condition  for  learnability  is  the  existence  of  a  compression 
scheme  of  the  style  described  by  Littlestone  and  Warmuth  [59]. 

It  is  worth  pointing  out  that  the  technique  described  in  Theorem  6.5,  combined  with  the 
results  of  Littlestone  and  Warmuth,  gives  an  alternative  method  for  analyzing  the  sample 
complexity  of  the  hypothesis  boosting  procedure  of  Section  2-3.  Specifically,  this  proof  shows 
that  Learn  can  be  used  as  a  compression  scheme  with  output  hypothesis  size  polynomial  in 
n,  s  and  logm.  Littlestone  and  Warmuth  show  that  such  a  compression  scheme  can  in  turn 
be  converted  into  a  learning  algorithm  with  sample  size  e_1  -p(n,s,log(l/c),log(l/^))  for  some 
polynomial  p. 

Furthermore,  notice  that  when  Learn  is  used  as  a  data  compression  scheme  (using  the 
technique  described  in  Theorem  6.5)  the  target  distribution  is  entirely  known,  both  at  the  top 
level  where  it  is  uniform  on  the  sample,  and  (with  some  simple  modifications)  at  each  lower 
level  as  well.  Thus,  in  such  a  case,  there  is  no  need  to  hypothesis  test  since  the  error  of  any 
hypothesis  can  be  computed  directly  by  evaluating  it  on  each  instance  and  using  our  knowledge 
about  the  target  distribution.  Also,  there  is  no  longer  a  need  to  filter  any  distributions  —  the 
given  distribution  can  be  directly  simulated.  For  these  reasons,  the  algorithm  can  be  shown  to 
run  much  faster  —  in  fact,  its  running  time  is  comparable  to  the  sample  size  stated  above. 
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Thus,  Littlestone  and  Warmuth’s  techniques  can  be  used  to  modify  Learn,  yielding  time 
and  sample  complexity  bounds  that,  like  those  of  Learn',  are  linear  in  1  /c  (ignoring  log  factors). 
However,  in  contrast  to  Learn',  the  resulting  procedure  is  not  space  efficient  since  the  entire 
sample  must  be  stored. 

2-6.3  Hard  functions  are  hard  to  learn 

Theorem  6.1’s  bound  on  the  size  of  the  output  hypothesis  also  implies  that  any  hard-to-evaluate 
concept  class  is  unlearnable.  Although  this  result  does  not  sound  surprising,  it  was  previously 
unclear  how  it  might  be  proved:  since  a  learning  algorithm’s  hypotheses  are  technically  per¬ 
mitted  to  grow  polynomially  in  1/e,  the  learnability  of  such  classes  did  not  seem  out  of  the 
question. 

This  result  yields  the  first  representation-independent  hardness  results  not  based  on  crypto¬ 
graphic  assumptions.  For  instance,  assuming  P/poly  ^  NP/poly,  the  class  of  polynomial-size, 
nondeterministic  Boolean  circuits  is  not  learnable.  (The  set  P/poly  (NP/poly)  consists  of  those 
languages  accepted  by  a  family  of  polynomial-size  deterministic  (nondeterministic)  circuits.) 
Furthermore,  since  learning  pattern  languages  was  recently  shown  [77]  to  be  as  hard  as  learning 
NP/poly,  this  result  shows  that  pattern  languages  are  also  unlearnable  under  this  relatively 
weak  complexity-theoretic  assumption. 

Theorem  6.6  Suppose  C  is  learnable,  and  assume  that  Xn  =  {0,1}".  Then  there  exists  a 
polynomial  p  such  that  for  all  concepts  c  £  C„  of  size  s ,  there  exists  a  circuit  of  size  p(n,s) 
exactly  computing  c. 

Proof:  Consider  the  set  of  2"  pairs  {(i,c(x))  :  x  £  A'„}.  By  Corollary  6.3,  there  exists  an 
algorithm  that,  with  positive  probability,  will  output  a  hypothesis  consistent  with  this  set  of 
elements  of  size  only  polynomial  in  n  and  s.  Since  this  hypothesis  is  polynomially  evaluatable, 
it  can  be  converted  using  standard  techniques  into  a  circuit  of  the  required  size.  ■ 

2-6.4  Hard  functions  are  hard  to  approximate 

By  a  similar  argument,  the  bound  on  hypothesis  size  implies  that  any  function  not  computable 
by  small  circuits  cannot  even  be  weakly  approximated  by  a  family  of  small  circuits,  for  some 
distribution  on  the  inputs. 

Let  /  be  a  Boolean  function  on  {0, 1}*,  D  a  distribution  on  {0,1}"  and  C  a  circuit  on  n 
variables.  Then  C  is  said  to  0 -approximate  f  under  D  if  the  probability  is  at  most  0  that 
C(x)  /  f(x)  on  an  assignment  x  chosen  randomly  from  {0, 1}"  according  to  D. 

Theorem  6.7  Suppose  some  function  f  cannot  be  computed  by  any  family  of  polynomial-size 
circuits.  Then  there  exists  a  family  of  distributions  D\,  £)2,  •  •  •,  where  Dn  is  over  the  set  {0, 1}", 
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such  that  for  all  polynomials  p  and  q,  there  exist  infinitely  many  n  for  which  there  exists  no 
n-variable  circuit  of  size  at  most  q(n )  that  ^  j  -approximates  f  under  Dn. 

Proof:  Throughout  this  proof,  we  will  assume  without  loss  of  generality  that  p(n )  —  q(n)  =  nk 
for  some  integer  k  >  1. 

Suppose  first  that  there  exists  some  k  such  that  for  all  n  and  every  distribution  D  on  {0, 1}", 
there  exists  a  circuit  of  size  at  most  nk  that  approximates  /  under  D.  Then  /  can,  in  a 

sense,  be  weakly  learned.  More  precisely,  there  exists  an  (exponential- time)  procedure  that,  by 
searching  exhaustively  the  set  of  all  circuits  of  size  nk,  will  find  one  that  -  ^-approximates 
/  under  some  given  distribution  D.  Therefore,  by  Theorem  3.1,  /  is  strongly  learnable  in  a 
similar  sense  in  exponential  time.  Applying  Theorem  6.6  (whose  validity  depends  only  on  the 
size  of  he  output  hypothesis,  and  not  on  the  running  time),  this  implies  that  /  can  be  exactly 
computed  by  a  family  of  polynomial-size  circuits,  contradicting  the  theorem’s  hypothesis. 

Thus,  for  all  Jb  >  1,  there  exists  an  integer  n  and  a  distribution  D  on  {0, 1}"  such  that  no 
circuit  of  size  at  most  nk  is  able  to  (1  —  ^-approximate  /  under  D.  To  complete  the  proof,  it 
suffices  to  show  that  this  implies  the  theorem’s  conclusion. 

Let  be  the  set  of  distributions  D  on  {0,1}"  for  which  no  circuit  of  size  nk  or  smaller 
(|  -  ^-approximates  /  under  D.  It  is  easy  to  verify  that  Z>*  D  Z>*+1  for  all  k,  n.  Also, 
since  every  function  can  be  computed  by  exponential  size  circuits,  there  must  exist  a  constant 
c  >  0  for  which  T1'"  =  0  for  all  n.  Let  n[&]  be  the  smallest  n  for  which  ^  0.  By  the 
preceding  argument,  n[Ar]  must  exist.  Furthermore,  re[fc]  >  k/c,  which  implies  that  the  set 
N  =  {n[fc]  :  k  >  1}  cannot  have  finite  cardinality. 

To  eliminate  repeated  elements  from  N,  let  ki  <  k2  <  •  •  •  be  such  that  n[fcj]  n[fc;]  for 
i  ±  j,  and  such  that  { n[k{ ]  :  i  >  1}  =  N.  Let  £>,  be  defined  as  follows:  if  i  =  n[^]  for  some 
j,  then  let  be  any  distribution  in  V ■’  (which  cannot  be  empty  by  our  definition  of  n[fc]); 
otherwise,  if  i  £  N,  then  define  Z?,  arbitrarily.  Then  Di,D2,  ...  is  the  desired  family  of  “hard” 
distributions.  For  if  k  is  any  integer,  then  for  all  k{  >  k,  Dn[kx]  €  *[t  j.  This  proves 

the  theorem.  ■ 

Informally,  Theorem  6.7  states  that  any  language  not  in  the  complexity  class  P/poly  cannot 
be  even  weakly  approximated  by  any  other  language  in  P/poly  under  some  “hard”  family  of 
distributions.  In  fact,  the  theorem  can  easily  be  modified  to  apply  to  other  circuit  classes  as 
well,  including  monotone  P/poly,  and  monotone  or  non-monotone  NC*  for  fixed  k.  (The  class 
NC*  consists  of  all  languages  accepted  by  polynomial-size  circuits  of  depth  at  most  0(log*  n), 
and  a  monotone  circuit  is  one  in  which  no  negated  variables  appear.)  In  general  the  theorem 
applies  to  all  circuit  classes  closed  under  the  transformation  on  hypotheses  resulting  from  the 
construction  of  Sections  2-3  and  2-4. 
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2-6.5  On-line  learning 

Finally,  we  consider  implications  of  Theorem  6.1  for  on-line  learning  algorithms.  In  the  on-line 
learning  model,  the  learner  is  presented  one  (randomly  selected)  instance  at  a  time  in  a  series 
of  trials.  Before  being  told  its  correct  classification,  the  learner  must  try  to  predict  whether  the 
instance  is  a  positive  or  negative  example.  An  incorrect  prediction  is  called  a  mistake.  In  this 
model,  the  learner’s  goal  is  to  minimize  the  number  of  mistakes. 

Previously,  Haussler,  Littlestone  and  Warmuth  [39]  have  shown  that  a  concept  class  C  is 
learnable  if  and  only  if  there  exists  an  on-line  learning  algorithm  for  C  with  the  properties  that: 

•  the  probability  of  a  mistake  on  the  mth  trial  is  at  worst  linear  in  m~p  for  some  constant 
0  <  (3  <  1,  and  (equivalently) 

•  the  expected  number  of  mistakes  on  the  first  m  trials  is  at  worst  linear  in  m"  for  some 
constant  0  <  a  <  1. 

(This  result  is  also  described  in  their  paper  with  Kearns  [38].)  Noting  several  examples  of  learn¬ 
ing  algorithms  for  which  this  second  bound  only  grows  poly-logarithmically  in  m,  the  authors 
ask  if  every  learnable  concept  class  has  an  algorithm  attaining  such  a  bound.  Theorem  6.8 
below  answers  this  open  question  affirmatively,  showing  that  in  general  the  expected  number 
of  mistakes  on  the  first  m  trials  need  only  grow  as  a  polynomial  in  logm.  Thus,  we  expect  only 
a  minute  fraction  of  the  first  m  predictions  to  be  incorrect. 

(This  result  should  not  be  confused  with  those  presented  in  another  paper  by  Haussler,  Lit¬ 
tlestone  and  Warmuth  [40].  In  this  chapter,  the  authors  describe  a  general  algorithm  applicable 
to  a  wide  collection  of  concept  classes,  and  they  show  that  the  expected  number  of  mistakes 
made  by  this  algorithm  on  the  first  m  trials  is  linear  in  logm.  However,  their  algorithm  re¬ 
quires  exponential  computation  time,  even  if  it  is  known  that  the  concept  class  is  learnable.  In 
contrast,  Theorem  6.8  states  that,  if  a  concept  class  is  learnable,  then  there  exists  an  efficient 
algorithm  making  poly-logarithmic  in  m  mistakes  on  average  on  the  first  m  trials.) 

Haussler,  Littlestone  and  Warmuth  [39]  also  consider  the  space  efficiency  of  on-line  learning 
algorithms.  They  define  a  space-efficient  learning  algorithm  to  be  one  whose  space  requirements 
on  the  first  m  trials  do  not  exceed  a  polynomial  in  n,  s  and  log  m.  Thus,  a  space  efficient 
algorithm  is  one  using  far  less  memory  than  would  be  required  to  store  explicitly  all  of  the 
preceding  observations.  The  authors  describe  a  number  of  space-efficient  algorithms  (though 
are  unable  to  find  one  for  learning  unions  of  axis-parallel  rectangles  in  the  plane),  and  so 
are  led  to  ask  whether  there  exist  space-efficient  algorithms  for  all  learnable  concept  classes. 
Surprisingly,  this  open  question  can  also  be  answered  affirmatively,  as  proved  by  the  theorem 
below. 
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Lastly,  Theorem  6.8  gives  a  bound  on  the  computational  complexity  of  on-line  learning  (in 
terms  of  m).  In  particular,  the  total  computation  time  required  to  process  the  first  m  examples 
is  only  proportional  to  mlog'm,  for  some  constant  c.  Thus,  in  a  sense,  the  “amortized”  or 
“average”  computation  time  on  the  mth  trial  is  only  poly-logarithmic  in  m.  (In  fact,  a  more 
careful  analysis  would  show  that  this  is  also  true  of  the  worst-case  computation  time  on  the 
mth  trial.) 

Theorem  6.8  Let  C  be  a  leamable  concept  class.  Then  there  exists  an  efficient  on-line  learning 
algorithm  for  C  unth  the  properties  that: 

•  the  probability  of  a  mistake  on  the  mth  trial  is  at  most  m-1  •  Pi(n,s,logm), 

•  the  expected  number  of  mistakes  on  the  first  m  trials  is  at  most  p2(n,s,\ogm), 

•  the  total  computation  time  required  on  the  first  m  trials  is  at  most  m  -p3(n,s,  log  m),  and 

•  the  space  used  on  the  first  m  trials  is  at  most  p4(n,  s,  log  m), 
for  some  polynomials  pu  p2,  p3,  p4. 

Proof:  Since  C  is  learnable,  there  exists  an  efficient  (batch)  algorithm  satisfying  the  properties 
of  Theorem  6.1.  Let  A  be  such  an  algorithm,  but  with  c/2  substituted  for  both  c  and  8.  Then 
the  chance  that  j4’s  output  hypothesis  incorrectly  classifies  a  randomly  chosen  instance  is  at 
most  c.  (This  technique  is  also  used  by  Haussler  et  al.  [38].) 

Fix  n  and  s,  and  let  m(c)  be  the  number  of  examples  needed  by  A.  From  Theorem  6.1, 
m(e)  <  (p/f)’ lge(l/0  for  some  constant  c  and  some  value  p  implicitly  bounded  by  a  polynomial 
in  n  and  s.  Let  e(m)  =  (p/m)-lgc(m/p).  Then  it  can  be  verified  that  m(e(m))  <  m  for  m  >  2p. 
Thus,  m  examples  suffice  to  find  a  hypothesis  whose  chance  of  error  is  at  most  c(m). 

To  convert  A  into  an  on-line  learning  algorithm  in  a  manner  that  preserves  time  and  space 
efficiency,  imagine  breaking  the  sequence  of  trials  into  blocks  of  increasing  size:  the  first  block 
consists  of  the  first  2p  trials,  and  each  new  block  has  twice  the  size  of  the  last.  Thus,  in  general, 
the  ith  block  has  size  s,  =  2'p>  and  consists  of  trials  a,  =  2(2,_1  -  l)p+ 1  through  6j  =  2(2’  —  l)p. 

On  the  trials  of  the  ith  block,  algorithm  A  is  simulated  to  compute  the  ith  hypothesis 
hi.  Specifically,  A  is  simulated  with  e  set  to  c(sj),  which  thus  bounds  the  probability  that  h, 
misclassifies  a  new  instance.  (Note  that  there  are  enough  instances  available  in  this  block  for  A 
to  compute  a  hypothesis  of  the  desired  accuracy.)  On  the  next  block,  as  the  (i  +  l)st  hypothesis 
is  being  computed,  /i,  is  used  to  make  predictions;  at  the  end  of  this  block,  ft,  is  discarded  as 
ft,+1  takes  its  place. 

Thus,  if  the  mth  trial  occurs  in  the  ith  block  (i.e.,  if  a,  <  m  <  ft*),  then  the  probability  of  a 
mistake  is  bounded  by  f(sj_i),  the  error  rate  of  ft,_i.  From  the  definition  of  c(),  this  implies  the 
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desired  bound  on  the  probability  of  a  mistake  on  the  mth  triad,  and,  in  turn,  on  the  expected 
number  of  mistakes  on  the  first  m  trials. 

Finally,  note  that  on  the  tth  block,  space  is  needed  only  to  store  the  hypothesis  from  the 
last  block  h,_i,  and  to  simulate  A's  computation  of  block  *’s  hypothesis.  By  Theorem  6.1,  both 
of  these  quantities  grow  polynomially  in  log(l/c).  By  our  choice  of  c,  this  implies  the  desired 
bound  on  the  algorithm’s  space  efficiency.  The  time  complexity  of  the  procedure  is  bounded  in 
a  similar  fashion.  ■ 

2-7  Conclusions  and  open  problems 

We  have  shown  that  a  model  of  leamability  in  which  the  learner  is  only  required  to  perform 
slightly  better  than  guessing  is  as  strong  as  a  model  in  which  the  learner’s  error  can  be  made 
arbitrarily  small.  The  proof  of  this  result  was  based  on  the  filtering  of  the  distribution  in  a 
manner  causing  the  weak  learning  algorithm  to  eventually  learn  nearly  the  entire  distribution. 
We  have  also  shown  this  proof  implies  a  set  of  general  bounds  on  the  complexity  of  PAC-learning 
(both  batch  and  on-line),  and  have  discussed  some  of  the  applications  of  these  bounds. 

It  is  hoped  that  these  results  will  open  the  way  on  a  new  method  of  algorithm  design  for 
PAC-learning.  As  previously  mentioned,  the  vast  majority  of  currently  known  algorithms  work 
by  finding  a  hypothesis  consistent  with  a  large  sample.  An  alternative  approach  suggested  by 
the  main  result  is  to  seek  instead  a  hypothesis  that  works  correctly  on  slightly  more  than  half 
the  distribution.  Perhaps,  such  a  hypothesis  is  easier  to  find,  at  least  from  the  point  of  view 
of  the  algorithm  designer.  This  approach  leads  to  algorithms  with  a  flavor  similar  to  the  one 
described  for  fc-term  DNF  in  Section  2-5.3.  To  what  extent  will  this  approach  be  fruitful  for 
other  classes  not  presently  known  to  be  learnable? 

Another  open  question  concerns  the  robustness  of  the  construction  described  in  this  chapter. 
Intuitively,  it  seems  that  there  should  be  a  close  relationship  between  reducing  the  error  of  the 
hypothesis,  and  overcoming  noise  in  the  data.  Is  this  a  valid  intuition?  Can  our  construction  be 
modified  to  handle  noise?  Can  the  construction  be  extended  to  the  p-concept  model  described 
in  Chapter  4? 

Finally,  turning  away  from  the  theoretical  side  of  machine  learning,  we  can  ask  how  well 
our  construction  would  perform  in  practice.  Often,  a  learning  program  (for  instance,  a  neural 
network)  is  designed,  implemented,  and  found  empirically  to  achieve  a  “good”  error  rate,  but 
no  way  is  seen  of  improving  the  program  further  to  enable  it  to  achieve  a  “great”  error  rate. 
Suppose  our  construction  is  implemented  on  top  of  this  learning  program.  Would  it  help?  This 
is  not  a  theoretical  question,  but  one  that  can  only  be  answered  experimentally,  and  one  that 
obviously  depends  on  the  domain  and  the  underlying  learning  program.  Nevertheless,  it  seems 
plausible  that  the  construction  might  in  some  cases  give  good  results  in  practice. 
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Statistical-perturbation  Methods  for 
Inference  of  Read-once  Formulas 


3-1  Introduction 

This  chapter  explores  in  detail  a  simple  but  powerful  statistical  technique  for  discovering  the 
structure  of  a  read-once  formula.  (A  formula  is  read-once  if  each  variable  appears  at  most  once 
in  the  formula.)  As  a  demonstration  of  its  power,  we  apply  this  technique  to  an  array  of  learning 
problems;  in  each  case,  we  obtain  the  first  provably  efficient  algorithm  that  effectively  solves 
the  given  learning  problem.  We  also  demonstrate  that  our  method  is  highly  robust  against  a 
great  deal  of  noise  and  randomness. 

Similar  to  the  Valiant  model  [83],  we  consider  the  problem  of  learning  read-once  formulas 
from  randomly  chosen  examples.  The  basic  idea  of  our  method  is  to  observe  the  statistical 
behavior  of  the  target  formula’s  output  under  various  simple  and  easily  sampled  perturbations 
of  the  target  distribution  (the  distribution  under  which  random  examples  are  chosen).  For 
example,  a  typical  perturbation  might  “hard-wire”  a  single  variable  to  some  fixed  value.  In  a 
variety  of  situations,  we  demonstrate  that  this  simple  technique  can  be  applied  to  effectively 
discover  much  or  all  of  the  target  formula’s  structure. 

For  example,  using  this  method,  we  are  able  to  derive  efficient  algorithms  for  exactly  iden¬ 
tifying  certain  classes  of  read-once  Boolean  formulas  when  the  observed  examples  are  chosen 
randomly  according  to  specific,  fixed  and  simple  distributions.  Even  when  the  formula’s  output 
is  corrupted  by  a  great  deal  of  random  misclassification  noise,  we  show  that  exact  identification 
can  be  achieved. 

We  also  apply  our  method  to  a  probabilistic  generalization  of  the  class  of  all  read-once 
Boolean  formulas  constructed  from  the  usual  basis  {and,  or,  not}.  We  show  that  an  arbitrarily 
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good  approximation  of  such  formulas  can  be  inferred  in  polynomial  time  against  any  product 
distribution  (i.e.,  any  distribution  in  which  the  setting  of  each  variable  is  chosen  independently 
of  the  settings  of  the  other  variables).  For  example,  this  shows  that  the  class  of  read-once 
Boolean  formulas  over  the  usual  basis  can  be  learned  in  polynomial  time  against  the  uniform 
distribution  in  the  sense  of  Valiant. 

The  problem  of  learning  Boolean  formulas  against  special  distributions  has  been  considered 
by  a  number  of  other  authors.  In  particular,  our  technique  closely  resembles  that  used  by 
Kearns  et  al.  [51]  for  learning  the  class  of  read-once  formulas  in  disjunctive  normal  form  (DNF) 
against  the  uniform  distribution.  A  similar  result,  though  based  on  a  different  method,  was 
obtained  by  Pagallo  and  Haussler  [66].  Our  results  extend  theirs  to  a  much  broader  class  of 
read-once  formulas. 

Also,  Linial,  Mansour  and  Nisan  [57]  used  a  technique  based  on  Fourier  spectra  to  learn 
the  class  of  constant-depth  formulas  (constructed  from  gates  of  unbounded  fan-in)  against  the 
uniform  distribution.  Furst,  Jackson  and  Smith  [27]  generalized  this  result  to  learn  this  same 
class  against  any  product  distribution.  Verbeurgt  [86]  gives  a  different  algorithm  for  learning 
DNF-formulas  against  the  uniform  distribution.  However,  all  three  of  these  algorithms  require 
quasi-polynomial  (npolylog<n))  time,  though  Verbeurgt’s  procedure  only  requires  a  polynomial- 
size  sample. 

Exact  identification  using  amplification  functions 

As  mentioned  above,  this  chapter  includes  efficient  algorithms  for  exactly  identifying  certain 
classes  of  read-once  Boolean  formulas  by  observing  the  target  formula’s  behavior  on  examples 
drawn  randomly  according  to  a  fixed  and  simple  distribution.  This  distribution  is  related  to  the 
formula’s  amplification  function.  The  amplification  function  Aj(p)  for  a  function  /  :  {0, 1}"  — ► 
{0, 1}  is  defined  as  the  probability  that  the  output  of  /  is  1  when  each  of  the  n  inputs  to  /  is 
1  independently  with  probability  p.  Amplification  functions  were  first  studied  by  Valiant  [83] 
and  Boppana  [16,  17]  in  obtaining  bounds  on  monotone  formula  size  for  the  majority  function. 

The  method  used  by  our  algorithms  is  of  central  interest.  For  several  classes  of  formulas, 
we  show  that  the  behavior  of  the  amplification  function  is  unstable  near  the  fixed  point;  that 
is,  the  value  of  Aj(p)  varies  greatly  with  a  small  change  in  p.  This  in  turn  implies  that  small 
but  easily  sampled  perturbations  of  the  fixed-point  distribution  (that  is,  the  distribution  where 
each  input  is  1  with  probability  p,  where  Aj(p)  =  p)  reveal  structural  information  about  the 
formula.  As  mentioned  above,  a  typical  perturbation  of  the  fixed-point  distribution  hard-wires 
a  single  variable  to  1  and  sets  the  remaining  variables  to  1  with  probability  p. 

We  apply  this  method  to  obtain  efficient  algorithms  for  exact  identification  of  classes  of 
read-once  formulas  over  various  bases.  These  include  the  class  of  logarithmic-depth  read-once 
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formulas  constructed  with  NOT  gates  and  three-input  majority  gates  (for  which  the  fixed-point 
distribution  is  the  uniform  distribution),  as  well  as  the  class  of  logarithmic-depth  read-once 
formulas  constructed  with  nand  gates  (for  which  the  fixed-point  distribution  assigns  1  to  each 
input  independently  with  probability  l/<p  as  0.618,  where  4>  =  (1  +  >/5)/2  is  the  golden  ratio). 
Thus,  for  these  classes,  since  the  fixed  point  of  the  amplification  function  is  the  same  for  all 
formulas,  we  obtain  a  single  simple  distribution  for  the  entire  class.  As  proved  by  Kearns 
and  Valiant  [52,  49],  these  same  classes  of  formulas  cannot  be  even  weakly  approximated  in 
polynomial  time  when  no  restriction  is  placed  on  the  target  distribution;  thus,  our  results  may  be 
interpreted  as  demonstrating  that  while  there  are  some  distributions  which  in  a  computationally 
bounded  setting  reveal  essentially  no  information  about  the  target  formula,  there  are  natural 
and  simple  distributions  which  reveal  all  information. 

For  Boolean  read-once  formulas  (a  superset  of  the  class  of  formulas  constructed  from  NAND 
gates)  there  is  an  efficient,  exact-identification  algorithm  using  membership  and  equivalence 
queries  due  to  Angluin,  Hellerstein  and  Karpinski  [8,  41].  The  class  of  read-once  majority 
formulas  can  also  be  exactly  identified  using  membership  and  equivalence  queries,  as  proved 
by  Hancock  [33]  and  Hellerstein  and  Karpinski  [42].  Briefly,  in  the  query  model,  the  learner 
attempts  to  infer  the  target  formula  by  asking  question,  or  queries,  of  a  “teacher.”  For  instance, 
the  learner  might  ask  the  teacher  what  the  formula’s  output  would  be  for  a  specific  assignment 
to  the  input  variable;  this  is  called  a  membership  query.  On  an  equivalence  query,  the  learner 
asks  if  a  given  conjectured  formula  is  equivalent  to  the  target  formula 

Note  that  our  algorithms’  use  of  a  fixed  distribution  can  be  regarded  as  a  form  of  “random” 
membership  queries,  since  this  fixed  and  known  distribution  can  be  easily  simulated  by  making 
random  membership  queries.  Thus,  our  algorithms  are  the  first  efficient  procedures  for  exact 
identification  of  logarithmic-depth  majority  and  NAND  formulas  using  only  membership  queries. 
Furthermore,  the  queries  used  are  non-adaptive  in  the  sense  that  they  do  not  depend  upon  the 
answers  received  to  previous  queries.  In  contrast,  all  previous  algorithms  for  exact  identification, 
including  the  algorithms  mentioned  above,  require  highly  adaptive  queries. 

We  also  prove  that  our  algorithms  are  robust  against  a  large  amount  of  random  misclas- 
sification  noise,  similar  to,  but  slightly  more  general  than  that  considered  by  Sloan  [81]  and 
Angluin  and  Laird  [9].  Specifically,  if  770  and  iji  represent  the  respective  probabilities  that  an 
output  of  0  or  1  is  misclassified,  then  a  robust  version  of  our  algorithm  can  handle  any  noise 
rate  for  which  q0  +  *7i  ^  1;  the  sample  size  and  computation  time  required  increase  only  by  an 
inverse  quadratic  factor  in  |1  —  f?o  -  ffil-  Again  regarding  our  algorithms  as  using  “random” 
membership  queries,  these  are  the  first  efficient  procedures  performing  exact  identification  in 
some  reasonable  model  of  noisy  queries.  Our  algorithms  can  also  tolerate  a  modest  rate  of 
malicious  noise. 


3-1 


Introduction  59 


Finally,  we  present  an  algorithm  that  learns  any  (not  necessarily  logarithmic-depth)  read- 
once  majority  formula  in  Valiant’s  model  against  the  uniform  distribution.  To  obtain  this  result 
we  first  show  that  the  target  formula  can  be  well  approximated  by  truncating  the  formula  to 
have  only  logarithmic  depth.  We  then  generalize  our  algorithm  for  learning  logarithmic-depth 
read-once  formulas  to  handle  such  truncated  formulas.  A  similar  result  also  holds  for  read-once 
nand  formulas  of  unbounded  depth. 

Probabilistic  read-once  formulas 

In  Section  3-7,  we  describe  an  algorithm  for  learning  a  probabilistic  generalization  of  the  class 
of  read-once  formulas  over  the  usual  basis  {and, or,  not}. 

We  adopt  from  Chapter  4  the  notion  of  a  probabilistic  concept  (p-concept).  A  p-concept  c  is 
a  function  which  maps  each  input-variable  assignment  x  to  a  real  number  c(x)  between  0  and  1. 
We  interpret  c(z)  as  the  probability  that  instance  x  will  be  positively  classified.  Thus,  in  the 
p-concept  model,  a  randomly  labeled  example  is  chosen  as  follows:  first,  an  instance  x  is  chosen 
at  random  according  to  the  target  distribution  on  the  instance  space;  then,  with  probability 
c(x),  the  labeled  example  ( x ,  1)  is  observed,  and  with  probability  1  —  c(x),  the  labeled  example 
(x,0)  i  s  observed.  Thus,  in  general,  the  learner  has  no  direct  access  to  the  function  c,  even  on 
individual  points. 

We  view  the  learning  problem  as  that  of  inferring  from  such  randomly  chosen  examples  a 
good  approximation  of  the  function  c  itself.  Thus,  we  ask  that  the  learner  infer  a  real-valued 
hypothesis  h  for  which  |/i(x)  -  c(x)|  is  small  for  most  instances  x.  This  is  called  learning  with 
a  model  of  probability. 

Specifically,  we  consider  the  problem  of  learning  a  class  of  real-valued  read-once  formulas, 
called  read-once  real  formulas.  Formulas  in  this  class  are  constructed  using  two  kinds  of  gates, 
or  operators:  The  first  gate,  denoted  MUL,  simply  multiplies  its  two  real- valued  inputs.  The 
second  gate,  LINZU,,  computes  the  function  LIN tw(y)  =  z  +  wy.  Here,  z  and  w  may  be  any  real 
numbers  for  which  z  and  z  +  w  are  both  in  the  range  [0,  lj.  Clearly,  for  a  Boolean  assignment 
to  the  input  variables,  a  formula  constructed  from  such  gates  outputs  a  real  number  betwopn 
0  and  1,  and  so  these  are  indeed  p-concepts.  We  show  that  this  class  can  be  learned  with  a 
model  of  probability  against  any  product  distribution  (such  as  the  uniform  distribution). 

Note  that,  for  Boolean-valued  inputs,  the  function  MUL  simply  computes  the  logical  and 
of  its  inputs,  and  LlNj  _]  computes  the  logical  negation  of  its  input.  Thus,  the  class  of  read- 
once  real  formulas  includes  the  class  of  read-once  Boolean  formulas  with  basis  {and,  or,  not}. 
Therefore,  our  result  demonstrates  for  the  first  time  the  existence  of  a  polynomial-time  al¬ 
gorithm  for  inferring  a  good  approximation  of  any  such  Boolean  formula  (against  a  product 
distribution). 
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Also,  a  gate  LIN™  can  alternatively  be  viewed  as  describing  the  behavior  of  a  noisy  or 
random  Boolean  gate  which,  on  input  0  randomly  outputs  1  with  probability  z ,  and  on  input  1 
outputs  1  with  probability  z  +  tv.  (If  the  input  to  such  a  randomized  gate  is  1  with  probability  p, 
then  the  output  is  easily  computed  to  be  1  with  probability  LIN™  =  z  +  wp.)  Thus,  for  the 
distributions  considered,  our  result  can  be  regarded  as  a  demonstration  of  the  learnability  of 
read-once  Boolean  formulas,  even  when  every  gate  and  every  wire  of  the  formula  is  corrupted 
by  significant  amounts  of  randomness. 

3-2  Preliminaries 

Given  a  Boolean  function  /  :  {0, 1}"  — »  {0, 1},  Boppana  [16, 17]  defines  its  amplification  function 
Af  as  follows:  Af(p)  =  Pr[f(Xi,.  ..,Xn)  =  1],  where  Xu...,Xn  are  independent  Bernoulli 
variables  that  are  each  1  with  probability  p.  The  quantity  Aj(p)  is  called  the  amplification 
of  f  at  p.  Valiant  [83]  uses  properties  of  the  amplification  function  to  prove  the  existence  of 
monotone  Boolean  formulas  of  size  0(n5  3)  for  the  majority  function  on  n  inputs.  Also,  we 
denote  by  D ^  the  distribution  over  {0, 1}"  induced  by  having  each  variable  independently  set 
to  1  with  probability  p. 

For  g,  e  {0,1}  and  i €  {l,...,n},  1  <  j  <  r,  we  write  f\xit  <-qu .  ..,xir  <-qr  to  denote 
the  function  obtained  from  /  by  fixing  or  hard- wiring  each  variable  xij  to  the  value  q} .  If  each 
q,  =  q  for  some  value  q,  we  abbreviate  this  by  / |x„ , . . . ,  xir  <-  q. 

In  our  framework,  the  learner  is  attempting  to  infer  an  unknown  target  concept  c  chosen 
from  some  known  concept  class  C.  In  this  chapter,  C  =  Un>i^n  *s  parameterized  by  the 
number  of  variables  n,  and  each  c  €  C„  represents  a  Boolean  function  on  the  domain  {0, 1}".  A 
polynomial-time  learning  algorithm  achieves  exact  identification  of  a  concept  class  (from  some 
source  of  information  about  the  target,  such  as  examples  or  queries)  if  it  can  infer  a  concept 
that  is  equal  to  the  target  concept  on  all  inputs.  A  polynomial-time  learning  algorithm  achieves 
exact  identification  with  high  probability  if  for  any  6  >  0,  it  can  with  probability  at  least  1  —  6 
infer  a  concept  that  is  equal  to  the  target  concept  on  all  inputs.  In  this  setting  polynomial 
time  means  polynomial  in  n  and  1/6.  Our  algorithms  achieve  exact  identification  with  high 
probability  when  the  example  source  is  a  particular,  fixed  distribution. 

In  the  distribution-free  or  probably  approximately  correct  (PAC)  learning  model,  introduced 
by  Valiant  [83]  and  described  in  previous  chapters,  the  learner  is  given  access  to  labeled  (positive 
and  negative)  examples  of  the  target  concept,  drawn  randomly  according  to  some  unknown 
target  distribution  D.  The  learner  is  also  given  as  input  c,  £  >  0.  The  learner’s  goal  is  to  output 
with  probability  at  least  1  -  6  a  hypothesis  h  that  has  probability  at  most  e  of  disagreeing  with 
c  on  a  randomly  drawn  example  from  D  (thus,  the  hypothesis  has  accuracy  at  least  1  -  ().  If 
such  a  learning  algorithm  exists  (that  is,  a  polynomial-time  algorithm  meeting  the  goal  for  any 


3-3 


Exact  identification  of  read-once  majority  formulas  61 


n  >  1,  any  c  €  C„,  any  distribution  D,  and  any  e,6),  we  say  that  C  is  PAC-leamable.  In  this 
setting,  polynomial  time  means  polynomial  in  n,  1/e  and  1/6.  In  this  chapter,  we  will  primarily 
be  interested  in  a  variant  of  Valiant's  model  in  which  the  target  distribution  is  known  a  priori 
to  belong  to  a  specific  restricted  class  of  distributions. 

Note  that  because  we  consider  only  read-once  formulas,  there  is  a  unique  path  from  any 
gate  or  variable  to  the  output.  We  define  the  level  or  depth  of  a  gate  A  to  be  the  number  of 
gates  (not  including  A  itself)  on  the  path  from  A  to  the  output.  Thus,  the  output  gate  is  at 
level  0.  Likewise,  we  define  the  level  or  depth  of  an  input  variable  to  be  the  number  of  gates 
on  the  path  from  the  variable  to  the  output.  The  depth  of  the  entire  formula  is  the  maximum 
level  of  any  input,  and  the  bottom  level  consists  of  all  gates  and  variables  of  maximum  depth. 

An  input  xt,  or  a  gate  A,  feeds  a  gate  A'  if  the  path  from  x,-  or  A  to  the  output  goes  through 
A'.  If  x,  or  A  is  an  input  to  A',  then  we  say  that  x<  or  A  immediately  feeds  A'.  For  any  two  input 
bits  Xj  and  Xj  we  define  r(xj,x;)  to  be  the  deepest  gate  A  fed  by  both  x*  and  Xj.  Likewise, 
r(xj,Xj,x*)  is  the  deepest  gate  A  fed  by  x,,  x,  ,  and  xt.  We  say  that  a  pair  of  variables  x,  and 
Xj  meet  at  the  gate  r(x<,Xj).  Also,  if  r(x,,x,)  =  r(x,,Xi)  =  T(xJ,xt)  =  r(x,,ij,xt),  then  we 
say  that  the  variables  x,,  x;  and  xk  meet  at  gate  r(xj,x^,xt);  otherwise,  the  triple  does  not 
meet  in  the  formula.  (Note  that  this  only  makes  sense  if  there  are  gates  with  more  than  two 
inputs,  such  as  a  three-input  majority  gate.) 

3-3  Exact  identification  of  read-once  majority  formulas 

In  this  section  we  use  properties  of  amplification  functions  to  obtain  a  polynomial- time  algo¬ 
rithm  that  with  high  probability  exactly  identifies  any  read-once  majority  formula  of  logarithmic 
depth  from  random  examples  drawn  according  to  a  uniform  distribution. 

This  type  of  formula  is  used  in  Chapter  2’s  proof  that  a  concept  class  is  weakly  learnable 
in  polynomial  time  if  and  only  if  it  is  strongly  learnable  in  polynomial  time.  That  is,  the 
hypothesis  output  by  the  boosting  procedure  described  in  that  chapter  can  be  viewed  as  a 
majority  formula  whose  inputs  are  the  hypotheses  output  by  the  weak  learning  algorithm.  We 
also  note  that  a  read-once  majority  formula  cannot  in  general  be  converted  into  a  read-once 
Boolean  formula  over  the  usual  {and,  OR,  not}  basis. 

It  can  be  shown  that  the  class  of  logarithmic-depth  reac  •nee  majority  formulas  is  not 
learnable  in  the  distribution-free  model  (modulo  some  cryptographic  assumptions;  see  Kearns 
and  Valiant  [52]  for  details).  Briefly,  this  can  be  proved  using  a  Pitt  and  Warmuth-style 
“prediction-preserving  reduction”  [70]  to  show  that  learning  read-once  majority  formulas  is  at 
least  as  hard  as  learning  general  Boolean  formulas.  Given  a  Boolean  formula  (of  logarithmic 
depth,  without  loss  of  generality),  the  main  idea  of  such  a  prediction-preserving  reduction 
is  to  replace  each  OR  gate  (respectively,  and  gate)  with  a  maj  gate,  one  of  whose  inputs  is 
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wired  to  a  variable  that,  under  the  target  distribution,  always  has  the  value  1  (respectively, 
0).  The  resulting  majority  formula  can  further  be  reduced  to  one  that  is  read-once  using  the 
substitution  method  of  Kearns  et  al.  [51].  Finally,  combined  with  Kearns  and  Valiant’s  result 
that  Boolean  formulas  are  not  learnable  (modulo  cryptographic  assumptions),  this  shows  that 
majority  formulas  are  also  unlearnable. 

Despite  the  hardness  of  this  class  in  the  general  distribution-free  framework,  we  show  that 
the  class  is  nevertheless  exactly  identifiable  when  examples  are  chosen  from  the  uniform  dis¬ 
tribution.  The  algorithm  consists  of  two  phases.  In  the  first  nhase,  we  determine  the  relevant 
variables  (i.e.,  those  that  occur  in  the  formula),  their  signs  (i.e.,  whether  they  are  negated  or 
not),  and  their  levels.  To  achieve  this  goal,  for  each  variable,  we  hard-wire  its  value  to  1  and 
estimate  the  amplification  of  the  induced  function  at  |  using  examples  drawn  randomly  from 
the  uniform  distribution  on  the  remaining  variables.  Here,  by  “hard- wiring”  a  variable  to  1, 
we  really  mean  that  we  apply  a  filter  that  only  lets  through  examples  for  which  that  variable 
is  1.  We  prove  that  if  the  variable  is  relevant,  then  with  high  probability  this  estimate  will  be 
significantly  smaller  or  greater  than  |,  depending  on  whether  the  variable  occurs  negated  or 
unnegated  in  the  formula;  otherwise,  this  estimate  will  be  near  Furthermore,  the  level  of  a 
relevant  variable  can  be  determined  from  the  amount  by  which  the  amplification  of  the  induced 
function  differs  from 

In  the  second  phase  of  the  algorithm,  we  construct  the  formula.  More  precisely,  we  first 
construct  the  bottom  level  of  the  formula,  and  then  recursively  construct  the  remaining  levels. 
To  construct  the  bottom  level  of  the  formula,  we  begin  by  finding  triples  of  variables  that  are 
inputs  to  the  same  bottom-level  gate.  To  do  this,  for  each  triple  of  relevant  variables  that 
have  the  largest  level  number,  we  hard-wire  the  three  variables  to  1  and  again  estimate  the 
amplification  of  the  induced  function  from  random  examples.  We  show  that  we  can  determine 
whether  the  three  variables  all  enter  the  same  bottom-level  gate  based  on  this  estimate. 

Briefly,  the  recursion  works  as  follows.  Suppose  that  we  are  currently  constructing  level  t 
of  the  formula  and  we  find  that  x*,  x;  ,  and  i*  are  inputs  to  the  same  level-t  gate.  Then  in  the 
recursive  call  we  replace  xt,  Xj,  and  xk  by  a  level- <  meta-variable  y  =  maj(x,,Xj,x*).  Since 
y  is  a  known  subformula,  its  output  on  any  example  can  be  easily  computed  and  y  can  be 
treated  like  an  ordinary  variable.  Furthermore,  since  ^  is  the  fixed  point  for  the  amplification 
function  of  any  read-once  majority  formula,  it  follows  that  y  is  1  with  probability  Thus,  for 
the  recursive  call  we  replace  all  triples  of  variables  that  enter  level- 1  gates  with  meta-variables, 
and  we  easily  obtain  our.  needed  source  of  random  examples  drawn  according  to  the  uniform 
distribution  on  the  new  variable  set  from  the  original  source  of  examples. 

For  the  remainder  of  this  section,  we  explore  some  of  the  properties  of  the  amplification 
function  of  read-once  majority  formulas,  leading  eventually  to  a  proof  of  the  correctness  of  this 
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Figure  1:  The  amplification  function  for  read-once  majority  formulas  for  complete  ternary 
trees  of  depth  h. 

algorithm. 

Lemma  3.1  Let  A'!,  X2  and  X3  be  three  independent  Bernoulli  variables,  each  1  with  proba¬ 
bility  pu  p2  and  p3,  respectively.  Then  Pr[MAJ(Xi,  A'2,  A'3)  =  1]  =  P\P2  +  PiP3  +  PiP3-2piP2P3- 

Proof:  The  stated  probability  is  exactly  the  chance  that  at  least  two  of  the  three  variables 
are  1.  ■ 

Lemma  3.1  implies  that  =  5  for  any  read-once  majority  formula  /.  Thus,  ^  is  a  fixed 

point  of  Aj. 

Our  approach  depends  on  the  fact  that  the  first  derivative  of  Aj  is  large  at  meaning  that 
a  slight  perturbation  of  (i.e.,  the  uniform  distribution)  tends  to  perturb  the  statistical 

-  behavior  of  the  formula  sufficiently  to  allow  exact  identification.  See  Figure  1  for  a  graph 
showing  the  amplification  function  for  balanced  read-once  majority  formulas  of  various  depths. 

We  perturb  £)<1/2)  by  hard-wiring  a  small  number  of  variables  to  be  1;  such  perturbations  can 
always  be  efficiently  sampled  by  simply  waiting  for  the  desired  variables  to  be  simultaneously 
set  to  1  in  a  random  example  from  D^xl7\ 
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We  begin  by  considering  the  effect  on  the  function’s  amplification  of  altering  the  probability 
with  which  one  of  the  variables  Xj  is  set  to  1.  This  will  be  important  in  the  analysis  that  follows. 

Lemma  3.2  Let  f  be  a  read-once  majority  formula,  and  let  t  be  the  level  of  an  unnegated 
variable  Xj.  Then  for  q  €  {0, 1},  A, |r,_,(|)  =  f  +  (|)‘  ( q  -  ±). 

Proof:  By  induction  on  t.  When  t  =  0,  the  formula  consists  just  of  the  variable  Xj,  and 
the  lemma  holds.  For  the  inductive  step,  let  fu  /2  and  /3  be  the  functions  computed  by 
the  three  subformulas  obtained  by  deleting  the  output  gate  of  /;  thus,  /  is  just  the  majority 
of  /i,  fi  and  /3.  Note  that  X,  occurs  in  exactly  one  of  these  three  subformulas — assume 
it  occurs  in  the  first.  Since  x ,  occurs  at  level  t  -  1  of  this  subformula,  by  the  inductive 
hypothesis,  A/,|Xj_?(|)  =  5  +  (|)'_1  (?-|), an(*  s*nce  xi  does  not  occur  in  the  other  subformulas, 
i4/,|xy_?(!)  =  ^  for  *  =  2,3.  From  Lemma  3.1,  it  follows  that  A/|Xj._,(|)  has  the  stated  value, 
completing  the  induction.  ■ 

It  can  now  be  seen  how  we  use  the  amplification  function  to  determine  the  relevant  variables 
of  /:  if  Xj  is  rele'  int,  then  the  statistical  behavior  of  the  output  of  /  changes  significantly  when 
ij  is  hard-wired  to  1.  Similarly,  the  sign  and  the  level  of  each  variable  can  be  readily  determined 
in  this  manner. 

Theorem  3.3  Let  f  be  a  read-once  majority  formula  of  depth  h.  Let  a  be  an  estimate  of 

a  =  for  some  variable  Xj,  and  assume  that  |d  -  a|  <  r  <  (|)h+2.  Then 

•  Xj  is  relevant  if  and  only  if  |d  -  1|  >  t; 

•  if  ij  is  relevant,  then  it  occurs  negated  if  and  only  if  a  <  1; 

•  Xj  occurs  at  level  t  if  and  only  if  jd  —  ||  —  (|)l+1  <  t. 

Proof:  This  proof  follows  from  straightforward  calculations  using  Lemma  3.2.  ■ 

Thus,  if  one  estimates  the  value  of  the  amplification  function  from  a  sample  whose  size  is 
polynomial  in  2fc,  then  with  high  probability  one  can  determine  which  variables  are  relevant,  as 
well  as  the  sign  and  level  of  every  relevant  variable.  Specifically,  we  can  apply  Chernoff  bounds 
(Lemma  2-3.6)  to  derive  a  sample  size  sufficient  to  ensure  that  all  the  above  information  is 
properly  computed  with  high  probability.  We  therefore  assume  henceforth  that  the  level  of 

every  variable  has  been  determined,  and  that  (without  loss  of  generality)  all  variables  are 

relevant  and  unnegated. 

More  problematic  is  determining  exactly  how  the  variables  are  combined  in  /.  A  natural 
approach  is  to  try  hard-wiring  pairs  of  variables  to  1,  and  to  again  estimate  the  amplification 
of  the  induced  function  in  the  hopes  that  some  structural  information  will  be  revealed.  The 
following  lemma,  which  is  useful  at  a  later  point,  shows  that  this  approach  fails. 
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Lemma  3.4  Let  f  be  a  read-once  majority  formula,  and  let  z<  and  Xj  be  distinct,  unnegated 
variables  which  occur  at  levels  t\  and  tn ,  respectively.  Then 


witDM+fr’+fi 


2> 


regardless  of  the  depth  d  of  X  —  r(zt,  z2). 

Proof:  By  induction  on  d.  Let  fx,  /2  and  /3  be  the  three  subformulas  of  /  which  are  inputs 
to  the  output  gate  so  that  /  =  maj(/j, /2, /3). 

If  d  =  0,  then  A  is  the  output  gate,  and  z,  and  z;  occur  in  two  of  the  subformulas  (say,  fx 
and  /2,  respectively).  From  Lemma  3.2,  it  follows  that 

^1*.,,-^)  =  \  +  (*)“ 

for  k  =  1,2,  and,  since  neither  z,  nor  z ,  is  relevant  to  /3,  The  stated  value 

for  4/|rt  rj^1(|)  follows  then  from  Lemma  3.1. 

If  d  >  0,  then  A  is  a  gate  occurring  in  one  of  the  subformulas  (say  fx)  at  level  d  —  1  of  the 
subformula.  By  inductive  hypothesis, 

^.,,-iG)  =  !  +  (*)“  +  (i),t. 

Also,  A/k|Xi,rj_i(A)  =  i  for  k  =  2,3.  The  stated  value  for  Aj\X)'X.^x{\)  again  follows  from 
Lemma  3.1.  ■ 

Thus,  if  two  relevant  variables  are  hard-wired  to  1,  no  information  is  obtained  by  knowing 
the  value  of  the  amplification  function.  That  is,  the  amplification  function  is  independent  of 
the  level  at  which  the  two  variables  meet. 

Therefore,  we  instead  consider  what  happens  when  three  relevant  variables  of  the  same  level 
are  fixed  to  1.  In  fact,  it  turns  out  to  be  sufficient  to  do  so  for  triples  of  variables  all  of  which 
occur  at  the  bottom  level  of  the  formula.  We  show  that  by  doing  so  one  can  determine  the  full 
structure  of  the  formula. 

For  each  triple  z<,  z;  and  z*  all  occurring  at  level  t .  there  are  essentially  two  cases  to 
consider;  either 

1.  the  triple  zf,  z2,  xk  does  not  meet  in  the  formula;  or 

2.  the  three  variables  z<,  z;  and  xk  meet  at  the  gate  T(z,, z;, z*).  We  divide  this  case  into 
two  sub-cases: 

(a)  z,,  Xj  and  xt  are  inputs  to  the  same  gate  so  that  r(z,,  z;,  z*)  occurs  at  level  t  —  1; 
or 
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(b)  r(x,, Xj, xk)  occurs  at  some  level  d  <  t  -  1. 

We  are  interested  in  separating  Case  ‘2a  from  the  other  cases  by  estimating  the  amplification 
of  the  function  when  all  three  variables  are  hard- wired  to  1.  This  is  sufficient  to  reconstruct 
the  structure  of  the  formula:  if  we  can  find  three  variables  that  are  inputs  to  some  gate  A 
(and  there  always  must  exist  such  a  triple),  then  we  can  essentially  replace  the  subformula 
consisting  of  the  three  variables  and  the  gate  A  by  a  new  meta-variable  whose  value  can  easily 
be  determined  from  the  values  of  the  original  three  variables.  Furthermore,  since  1  is  a  fixed 
point  for  all  read-once  majority  formulas  the  meta-variables’  statistics  will  be  the  same  as  those 
of  the  original  variables.  Thus,  the  total  number  of  variables  is  reduced  by  two,  and  the  rest  of 
the  formula’s  structure  can  be  determined  recursively. 

The  following  two  lemmas  analyze  the  amplification  of  the  function  when  three  variables 
are  hard-wired  to  1  in  both  of  the  above  cases.  We  begin  with  Case  2: 

Lemma  3.5  Let  f  be  a  read-once  majority  formula.  Let  £,,  Xj  and  xk  be  three  distinct, 
unnegated  inputs  which  occur  at  levels  t1(  t2  and  (3,  respectively,  and  which  meet  at  gate 
A  =  T(Xi,Xj,xk).  Let  d  be  the  level  of  A.  Then 


^  j+(3_2d—  1 


Proof:  By  induction  on  d.  As  in  the  preceding  lemmas,  suppose  that  /  =  maj(/15  /2,  /3).  If 
d  =  0,  then  A  is  the  output  gate  of  /,  and,  without  loss  of  generality,  x,-,  Xj  and  xk  occur  one 
each  in  fu  f2  and  /3,  respectively.  From  Lemma  3.2,  AJr\Zt ,r,,rj,~i(|)  =  |  +  (^)<r.  for  r  =  1,2,3. 
The  stated  value  for  Af |r,,r,,n-.i(  f)  follows  from  Lemma  3.1. 

If  d  >  0,  then  one  of  the  subformulas  (say,  /v)  contains  A  at  depth  d  —  1.  By  inductive 


hypothesis, 


A 


Jx\x„x,,tk 


\  +  (i)“  +  (i)'J  +  (i)‘s  -  (i)“+,’+‘’-2d-2 


and  of  course,  Aj^ZtZjiZk^.i(L)  =  \  for  r  =  2,3.  The  proof  is  completed  by  again  applying 
Lemma  3.1.  ■ 

So,  unlike  the  situation  in  which  only  two  variables  are  hard-wired  to  1,  here  the  value  of  the 
amplification  function  depends  on  the  level  of  the  formula  at  which  the  three  variables  meet. 
However,  it  may  be  the  case  that  x,,  xj,  and  xk  do  not  meet  at  all  (i.e.,  we  may  be  in  Case  1). 
The  next  lemma  considers  this  case. 


Lemma  3.6  Let  f  be  a  read-once  majority  formula.  Let  x,,  X,  and  xk  be  three  distinct,  un¬ 
negated  inputs  that  occur  at  levels  tx,  t2  and  t3,  respectively,  and  for  which  A'  =  r(x;,Xj)  / 
r(x,,xj,xi)  =  A.  Then 


l/\x,.x,.x t 
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regardless  of  the  levels  d  and  d'  of  gates  A  and  A'. 

Proof:  By  induction  on  d.  As  before,  assume  that  /  =  MAJ(/i,/2,/3).  If  d  =  0,  then  A 
is  the  output  gate.  A'  occurs  (say)  in  /i,  and  x*  in  /2.  From  Lemma  3.4,  A/1j,.iiXjiXJk_  1(5)  = 
\  and  from  Lemma  3.2,  A/j|r„x>irk^1(|)  =  ±  +  (*)‘s.  Also,  =  \. 

Lemma  3.1  then  implies  the  stated  value  for 

If  d  >  0,  then  A  occurs,  say,  in  fx.  By  inductive  hypothesis,  Afl !**,*>,*»— 1(5)  =  f  +  (f )*1  + 
(I)*3  +  and  clearly  AJr\t._X)<Xk~.x{\)  =  5  for  r  =  2,3.  An  application  of  Lemma  3.1 

completes  the  induction.  ■ 

Combining  these  lemmas,  we  can  show  that  Case  2a  can  be  separated  from  the  other  cases 
by  estimating  the  function’s  amplification  with  triples  of  variables  hard- wired  to  1. 

Theorem  3.7  Let  f  be  a  read-once  majority  formula.  Let  xi(  Xj  and  xfc  be  three  distinct , 
unnegated,  level-t  inputs.  Let  a  be  an  estimate  of  a  =  A/|Xllr,,x»— 1(2)  fOT  which  |a  —  a|  <  r  < 
3  (|)t+4.  Then  xi(  Xj  and  x*  are  inputs  to  the  same  gate  of  f  if  and  only  if  a  <  1  +  (^)*  +  t. 

Proof:  If  Case  2a  applies,  then  Lemma  3.5  implies  that  a  =  5  +  (^)*.  Otherwise,  if  either 
Case  1  or  2b  applies,  then  Lemmas  3.5  and  3.6  imply  that  a  >  |  +  (^)*  +  3  (|)<+3.  The  theorem 
follows  immediately.  ■ 

We  are  now  ready  to  state  the  main  result  of  this  section: 

Theorem  3.8  There  exists  an  algorithm  with  the  following  properties:  Given  h,  n,  6  >  0,  and 
access  to  examples  drawn  from  the  uniform  distribution  on  {0, 1}"  and  labeled  by  any  read-once 
majority  formula  f  of  depth  at  most  h  on  n  variables,  the  algorithm  exactly  identifies  f  with 
probability  at  least  1  -  b.  The  algorithm's  sample  complexity  is  0( 4h  ■  log {n/b)),  and  its  time 
complexity  is  0(4h  •  (r3  4-  n)  •  log(n/<5)),  where  r  is  the  number  of  relevant  variables  appearing 
in  the  target  formula. 

Proof:  First,  for  each  variable  ij,  estimate  the  function’s  amplification  with  Xj  hard-wired 
to  1.  (We  will  ensure  that,  with  high  probability,  this  estimate  is  within  (^)A+2  of  the  true 
amplification.)  It  follows  from  Theorem  3.3  that  after  this  phase  of  the  algorithm,  with  high 
probability  we  know  which  variables  are  relevant,  and  the  sign  and  depth  of  each  relevant 
variable.  (So,  we  assume  from  now  on  that  the  formula  is  monotone.) 

In  the  second  phase  of  the  algorithm,  we  build  the  formula  level  by  level  from  bottom  to 
top.  To  build  the  bottom  level,  for  all  triples  of  variables  Xj,  Xj,  xk  that  enter  the  bottom 
level,  we  estimate  the  amplification  with  x,,  Xj .  and  xk  hard-wired  to  1.  (We  will  ensure  that, 
with  high  probability,  this  estimate  is  within  3  (|),1+4  of  the  true  amplification.)  It  follows  from 
Theorem  3.7  that  we  can  determine  which  variables  enter  the  same  bottom-level  gates. 
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LearnMajorityFormula(n,  h,  6) 

E  «—  0(4A  log(n/6))  labeled  examples  from 

X  «-  0 

for  1  <  i  <  n 

E1  examples  from  E  for  which  x,  =  1 
d  «—  fraction  of  E1  that  are  positive 
if  |d  -  ||  >  (i)h+2  then 

if  d  >  ^  then  X  <-  X  U  {xj 
else  X  <-  X  U  {x,} 
t(ii )  «—  compute-level(d,  h) 
BuildFonnula(/i,  X ,  E) 


BuildFormula(t,  X ,  E) 

if  t  =  0  then  target  formula  is  only  variable  in  A' 
else 

for  all  Xi,xj,xk  €  X  for  which  t(x,)  =  f(x;  )  =  <(!*)  =  t 
E‘  «—  examples  from  E  for  which  Xi  =  x3=xk  =  l 
a  *-  fraction  of  E '  that  are  positive 
if  d  <  t  +  (5)'  +  3  (|)‘+4  then 

let  y  =  MAJ(x,-,Zj,2t)  be  a  new  variable 

t(y)  *~t~  1 


X  (Xu  {y})- 
BuildFormula(t  -  1,A,  E) 


Figure  2:  Algorithm  for  exactly  identifying  read-once  majority  formulas  of  depth  h.  Procedure 
compute-level(d,  h)  computes  the  level  associated  with  a  as  given  by  Theorem  3.3. 

We  want  to  recurse  to  compute  the  other  levels;  however,  we  cannot  hard-wire  too  many 
variables  without  the  filter  requiring  too  many  examples.  The  key  observation  is  that  on  exam¬ 
ples  drawn  from  the  uniform  distribution,  the  output  of  any  subformula  is  1  with  probability  ~. 
Thus,  the  inputs  into  any  level  are  in  fact  distributed  according  to  a  uniform  distribution.  Since 
we  compute  the  formula  from  bottom  to  top,  the  filter  can  just  compute  the  value  for  the  known 
levels  to  determine  the  inputs  to  the  level  currently  being  learned.  Our  algorithm  is  described 
in  Figure  2. 

Given  that  the  estimates  foT  the  amplification  function  have  the  needed  accuracy,  the  proof 
of  correctness  follows  from  Theorems  3.3  and  3.7.  Specifically,  for  each  variable  x,,  we  need 
a  good  estimate  d  of  a  —  >1  /|ar,— 1  ( ^ );  we  require  that  the  chosen  sample  be  sufficiently  large 
that  |a  —  a|  <  2-*h+2>  with  probability  at  least  1  —  6/2n.  Then  every  such  estimate  for  the  n 
variables  will  have  the  needed  accuracy  with  probability  at  least  1  —  6/2.  These  estimates  thus 
satisfy  the  requirements  of  the  first  phase  of  the  algorithm. 
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In  the  second  phase,  we  require  good  estimates  of  the  formula’s  amplification  when  triples  of 
variables  are  hard-wired.  In  fact,  we  need  such  estimates  not  only  when  ordinary  variables  are 
hard-wired,  but  also  when  we  hard-wire  meta-variables.  Note  that,  assuming  all  estimates  have 
the  needed  accuracy,  every  (meta-)variable  added  to  the  set  X  in  Figure  2  in  fact  computes 
some  subformula  g  of  /.  Thus,  for  every  triple  of  subformulas  gu  g2  and  g3  of  /,  our  algorithm 
requires  an  estimate  a  of  o,  the  amplification  of  /  at  | ,  given  that  the  output  of  each  subformula 
g i,  <72  and  g3  is  fixed  to  the  value  1.  Since  a  read-once  majority  formula  on  n  variables  has  at 
most  3n/2  subformulas  (since  it  has  at  most  n/2  MAJ  gates),  we  require  a  sample  sufficiently 
large  that  |a  -  d|  <  3  (5)A+<  with  probability  at  least  1  -  4£/(3n)3.  The  chance  that  all  of  the 
(at  most  (3n/2)3)  estimates  have  the  needed  accuracy  is  then  at  least  1-6/2. 

Thus,  a  sufficiently  large  sample  provides  all  of  the  needed  estimates  with  probability  at 
least  1  —  6.  The  sample  size  required  can  be  derived  using  a  standard  application  of  Chernoff 
bounds  (Lemma  2-3.6).  ■ 

Note  that  our  algorithm’s  sample  complexity  has  only  a  logarithmic  dependence  on  the 
number  of  irrelevant  attributes.  Also,  it  follows  immediately  from  Theorem  3.8  that  any  read- 
once  majority  formula  of  depth  O(logn)  can  be  exactly  identified  in  polynomial  time. 

Finally,  we  note  that  our  algorithm  can  be  modified  to  work  without  receiving  a  bound  for 
the  height  of  the  formula  as  input;  the  time  and  sample  complexity  only  increase  by  a  factor 
of  two.  The  idea  is  to  guess  an  initial  value  of  h  =  1  and  to  increment  our  guess  each  time 
the  algorithm  fails;  it  can  be  shown  that,  if  the  formula’s  height  is  greater  than  our  current 
guess,  then  this  fact  will  become  evident  by  our  algorithm’s  inability  to  successfully  construct  a 
formula.  (Specifically,  the  algorithm  BuildFormula  in  Figure  2  will  reach  a  point  at  which  there 
remain  level-t  variables  in  A',  but  no  three  remaining  level-t  variables  are  immediate  inputs  to 
the  same  gate.) 

3-4  Exact  identification  of  read-once  positive  nand  formulas 

In  this  section  we  use  the  properties  of  the  amplification  function  to  obtain  a  polynomial-time 
algorithm  that  with  high  probability  exactly  identifies  any  read-once  positive  nand  formula  of 
logarithmic  depth  from  D w  where  6  is  the  constant  ( y/b  -  l)/2  ss  0.618.  Note  that  rp  =  1  /</>  = 
6-1,  where  6  is  the  golden  ratio. 

The  class  of  read-once  positive  NAND  formulas  is  equivalent  to  the  class  of  read-once  formulas 
constructed  from  alternating  levels  of  or/and  gates,  starting  with  an  OR  gate  at  top  level,  and 
with  the  additional  condition  that  each  variable  is  negated  if  and  only  if  it  enters  an  OR  gate. 
This  observation  is  easily  proved  by  repeated  application  of  DeMorgan’s  law. 

It  is  interesting  to  compare  our  result  with  what  is  known  about  learning  this  class  of  for¬ 
mulas  in  other  models.  It  follows  from  the  results  of  Kearns  and  Valiant  [52]  and  Pitt  and 
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Warmuth  [70]  that  learning  this  class  of  formulas  is  hard  in  the  distribution- free  model  (under 
cryptographic  assumptions).  Thus,  there  exist  distributions  that  reveal  essentially  no  informa¬ 
tion  about  the  formula  that  is  useful  for  prediction  to  a  computationally- bounded  algorithm. 
If  one  views  the  sampling  of  the  distribution  D as  a  form  of  non-adaptive  “random  mem¬ 
bership  queries,”  our  result  can  also  be  compared  with  the  algorithm  of  Angluin,  Hellerstein 
and  Karpinski  [8]  which  uses  membership  and  equivalence  queries  that  are  considerably  more 
complicated  and  are  highly  dependent  on  the  target  concept;  on  the  other  hand,  their  algorithm 
can  be  used  to  identify  a  broader  class  of  formulas. 

We  show  that  this  class  of  formulas  is  learnable  when  examples  are  chosen  from  a  distribution 
in  which  each  variable  is  1  with  probability  tp.  The  basic  structure  of  the  algorithm  is  just  like 
that  of  the  preceding  algorithm  for  identifying  read-once  majority  formulas.  In  the  first  phase 
of  the  algorithm,  we  determine  the  relevant  variables  and  their  depths  by  hard-wiring  each 
variable  to  0,  and  estimating  the  amplification  of  the  induced  function  at  rp  using  random 
examples  from  D (v,).  In  the  second  phase  of  the  algorithm,  we  construct  the  formula  by  finding 
pairs  of  variables  that  are  direct  inputs  to  a  bottom-level  gate.  Here,  we  show  that  this  is 
possible  by  hard-wiring  pairs  of  variables  to  0  and  estimating  the  function’s  amplification. 
After  learning  the  structure  of  the  bottom  level  of  the  formula,  we  again  are  able  to  construct 
the  remaining  levels  recursively. 

Since  the  techniques  used  in  this  section  are  so  similar  to  those  in  Section  3-3,  the  proofs 
of  the  lemmas  and  theorems  have  been  omitted.  Most  of  the  lemmas  can  be  proved  by  simple 
induction  arguments  as  before. 

We  turn  now  to  a  discussion  of  some  of  the  properties  of  the  amplification  function  of 
read-once  positive  nand  formulas;  these  lead  to  a  proof  of  the  correctness  of  our  algorithm. 

Lemma  4.1  Let  X\  and  X2  be  independent  Bernoulli  variables,  each  1  with  probability  p\  and 
p2,  respectively.  Then  Pr[NAND(A"i,  X2)  =  1]  =  1  -  P\P2- 

It  is  easily  verified  that  1  -  rp2  =  ip,  and  thus  that  ip  is  a  fixed  point  of  the  amplification 
function  Aj  whenever  /  is  a  read-once  positive  NAND  formula.  Once  again,  our  approach 
depends  on  the  fact  that  slight  perturbations  of  D W  tend  to  perturb  the  statistical  behavior 
of  the  formula  sufficiently  to  allow  exact  identification. 

Lemma  4.2  Let  f  be  a  read-once  positive  nand  formula,  and  let  t  be  the  level  of  some  vari¬ 
able  Xj.  Then  A}\Xj^q{xp)  =  ip  +  (q  -  ip)(-ip)*. 

Thus,  hard-wiring  an  even-leveled  input  to  0  decreases  the  amplification  while  hard-wiring 
an  odd-leveled  input  to  0  increases  the  amplification.  To  give  some  intuition  explaining  this 
behavior,  consider  the  correspondence  described  above  between  read-once  positive  nand  for¬ 
mulas  and  leveled  or/and  formulas.  An  even-leveled  input  corresponds  to  an  input  to  an  AND 
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gate  and  thus  hard-wiring  that  input  to  0  clearly  decreases  the  amplification.  However,  an 
odd-leveied  input  corresponds  to  an  input  that  is  first  negated  and  then  fed  to  an  OR  gate; 
thus,  this  case  corresponds  to  hard-wiring  the  input  to  an  OR  gate  to  1  which  clearly  increases 
the  amplification  function. 

As  we  saw  in  the  last  section,  the  amplification  function  can  be  used  to  determine  the 
relevant  variables  of  /:  if  Xj  is  relevant  then  the  statistical  behavior  of  the  output  of  /  changes 
significantly  when  x j  is  hard-wired  to  0.  Similarly,  the  level  of  each  variable  can  be  computed 
in  this  manner. 

Theorem  4.3  Let  f  be  a  read-once  positive  NAND  formula  of  depth  h.  Let  a  be  an  estimate  of 
a  =  for  some  variable  Xj,  and  assume  that  |d  —  a|  <  r  <  rph+l /2.  Then 

•  Xj  is  relevant  if  and  only  if  |d  -  rp\  >  r; 

•  Xj  occurs  at  level  t  if  and  only  if  \rp  +  (— ip)t+1  —  d|  <  r. 

We  next  consider  the  effect  on  the  amplification  function  of  hard-wiring  two  inputs.  Unlike 
the  case  of  majority  formulas,  measuring  the  amplification  of  the  function  when  pairs  of  variables 
are  hard-wired  to  0  reveals  a  great  deal  of  information  about  the  structure  of  the  formula.  In 
particular,  the  value  of  the  amplification  function  when  two  level- <  variables  and  Xj  are 
hard-wired  to  0  depends  critically  on  the  depth  of  r(i<,x;). 

Lemma  4.4  Let  f  be  a  read-once  positive  NAND  formula,  and  let  x,  and  Xj  be  two  distinct 
variables  which  occur  at  levels  tj  and  t2,  respectively ,  and  for  which  A  =  r(xj,x;)  is  at  level  d. 
Then 

Aj |r.,,,-o(^)  =  i’  +  (-t^)‘,+1  + 

Using  the  same  ideas  as  in  the  last  section,  it  can  now  be  proved  that,  given  a  good  estimate 
of  the  amplification  function,  one  can  determine  which  variables  meet  at  bottom-level  gates. 

Theorem  4.5  Let  f  be  a  read-once  positive  NAND  formula.  Let  x,  and  Xj  be  two  level-t  inputs. 
Let  a  be  an  estimate  of  a  =  Aj\Xlit)~o(rp)  for  which  |d  -  a|  <  r  <  rpi+3/2.  Then  x,  and  Xj  are 
inputs  to  the  same  level-t  gate  of  f  if  and  only  if  \tp  +  (-V’),+1  —  d|  <  r. 

We  are  now  ready  to  state  the  main  result  of  this  section: 

Theorem  4.0  Let  rp  =  l/<p  =  (\/5  -  l)/2.  Then  there  exists  an  algorithm  with  the  following 
properties:  Given  h,  n,  6  >  0,  and  access  to  examples  drawn  from  the  distribution  on 
{0,1}"  and  labeled  by  any  read-once  positive  NAND  formula  f  of  depth  at  most  h  on  n  vari¬ 
ables,  the  algorithm  exactly  identifies  f  with  probability  at  least  1-6.  The  algorithm’s  sample 
complexity  is  0(<t>2h  ■  log(n/6)),  and  its  time  complexity  is  0{<}>2h  ■  (r2  +  n)  •  log(n/6)),  where  r 
is  the  number  of  relevant  variables  appearing  in  the  target  formula. 
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Our  algorithm  is  obtained  by  making  the  obvious  modifications  to  LeamMajorityFormula 
and  BuildFormula.  The  proof  that  this  algorithm  is  correct  follows  from  the  preceding  lemmas 
and  theorems,  and  is  similar  to  the  proof  of  Theorem  3.8. 

As  before,  it  follows  immediately  that  any  read-once  positive  NAND  formula  of  depth  at 
most  O(logn)  can  be  exactly  identified  in  polynomial  time. 

3-5  Handling  random  misclassification  noise 

Because  the  algorithms  described  in  the  preceding  sections  are  statistical  in  nature,  they  are 
easily  modified  to  handle  a  considerable  amount  of  noise.  In  this  section,  we  describe  a  robust 
version  of  our  algorithm  for  learning  logarithmic- depth  read-once  majority  formulas.  Although 
omitted,  a  similar  (though  slightly  more  involved)  algorithm  can  be  derived  for  nand  formulas. 

Our  algorithm  is  able  to  handle  a  kind  of  random  misclassification  noise  which  is  similar,  but 
slightly  more  general  than  that  considered  by  Angluin  and  Laird  [9],  and  Sloan  [81].  Specifically, 
the  output  of  the  target  formula  is  “flipped”  with  some  fixed  probability  which  may  depend  on 
the  formula’s  output.  Thus,  if  the  true,  computed  output  of  the  formula  is  0,  then  the  learner 
sees  0  v.’ith  probability  1  -  r)Q,  and  1  with  probability  i}0,  for  some  quantity  t}0.  Similarly,  a  true 
output  of  1  is  observed  to  be  0  with  probability  jjj  and  1  with  probability  1-rji.  When  j/0  =  »7i, 
this  noise  model  is  equivalent  to  that  considered  by  Angluin  and  Laird,  and  Sloan.  Note  that 
when  t]Q  +  t]i  =  1,  outputs  of  0  or  1  are  entirely  indistinguishable  in  an  information-theoretic 
sense.  Moreover,  we  can  assume  without  loss  of  generality  that  »?o  +  »?i  <  1  by  symmetry  of  the 
behavior  of  the  formula  /  with  its  negation  ->/. 

If  we  regard  our  algorithm’s  use  of  a  fixed  distribution  as  a  form  of  membership  query,  we 
can  aLo  handle  large  rates  of  misclassification  noise  in  the  queries.  Here  the  formulation  of  a 
meaningful  noise  model  is  more  problematic.  In  particular,  we  wish  to  disallow  the  uninteresting 
technique  of  repeatedly  querying  a  particular  instance  in  order  to  obtain  its  true  classification 
with  overwhelming  probability.  Thus,  we  consider  a  model  in  which  noisy  labels  are  persistent: 
for  each  instance  z,  on  the  first  query  to  z,  the  true  output  of  the  target  concept  is  computed 
and  is  reversed  with  probability  t?0  or  r)x,  according  to  whether  the  true  output  is  0  or  1  (as 
described  above).  However,  on  all  subsequent  queries  to  z,  the  label  returned  is  the  same  as  the 
label  returned  with  the  first  query  to  z.  A  natural  interpretation  of  such  persistent  noise  is  that 
of  a  teacher  who  is  simply  wrong  on  certain  instances,  and  cannot  be  expected  to  change  his 
mind  with  repeated  sampling.  This  kind  of  persistent  noise  is  not  a  problem  for  our  algorithms 
because,  when  n  is  large,  the  algorithm  is  extremely  unlikely  to  query  the  same  instance  twice. 

Our  algorithm  assumes  that  170  +  r?  1  is  bounded  away  from  1  so  that  T]0  +  r)i  <  1  -  p  for  some 
known  positive  quantity  p.  The  error  rates  themselves,  r) 0  and  are  assumed  to  be  unknown. 
Our  algorithm  exactly  identifies  the  target  formula  with  high  probability  in  time  polynomial  in 
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all  of  the  usual  parameters,  and  1  / p. 

Our  robust  algorithm  has  a  similar  structure  to  that  of  the  algorithm  described  previously  for 
the  noise-free  case:  The  algorithm  begins  by  determining  the  relevance  and  sign  of  each  variable. 
However,  it  is  not  clear  at  this  point  how  the  level  of  each  variable  might  be  ascertained  in  the 
presence  of  noise.  Nevertheless,  it  turns  out  to  be  possible  to  find  three  bottom-level  variables 
which  are  inputs  to  the  same  gate.  As  before,  once  such  a  triple  has  been  discovered,  the 
remainder  of  the  formula  can  be  identified  recursively. 

To  start  with,  note  that  if  p  is  the  probability  that  a  1  is  output  by  the  target  formula  / 
under  some  distribution  on  {0, 1}",  then  the  probability  that  a  1  is  observed  by  the  learner  is 

P(1  ~  m)  +  (1  ~p)Vo  =  p(  1  -  rjo-  j?i)  +  T)0  . 

Thus,  Aj(p )  =  A/(p)  ■  (1  -  rjo  -  T]i)  +  T)0  is  the  probability  that  a  1  is  observed  when  each 
input  is  1  with  probability  p.  Under  the  uniform  distribution,  a  1  is  observed  with  probability 
f  =  Aj(\).  Since  t/0  and  r are  unknown,  £  is  unknown  as  well.  However,  an  accurate  estimate 
£  (say,  within  Q(p/ 2'*)  of  £)  can  be  efficiently  obtained  in  the  usual  manner  by  sampling. 

The  next  lemma  shows  that  a  variable  i.'s  relevance  and  sign  can  be  determined  by  hard¬ 
wiring  it  to  1  and  comparing  £  to  an  estimate  of  the  value  a  =  A/|x<_i(|). 

Lemma  5.1  Let  f  be  read-once  majority  formula  of  depth  h.  Let  £  and  a  be  estimates  of 
£  =  Aj(±)  and  a  =  A}\tj^i(\),  for  some  variable  Xj.  Assume  |o  -  d|  <  r  and  <  r  for 

some  r  <  p/2h+3.  Then 

•  Xj  is  relevant  if  and  only  if  |a  -  £|  >  2 r; 

•  if  Xj  is  relevant ,  then  it  occurs  negated  if  and  only  if  a  <  £. 

Proof:  Note  that  a— £  =  (1  —  ?/o — »7i  )(^ /(*■,-— i (■§) —  o  )•  The  lemma  then  follows  from  Lemma  3.2, 
and  by  noting  that  1  —  ij0  —  >  p.  ■ 

More  difficult  is  the  problem  of  determining  the  level  of  each  variable  since  go  and  gt 
are  unknown.  Nevertheless,  it  turns  out  to  be  possible  to  identify  the  formula  without  first 
determining  the  level  of  each  variable.  In  particular,  we  can  determine  a  triple  of  variables 
which  are  inputs  to  the  same  bottom-level  gate.  As  described  in  Section  3-3,  once  this  is  done, 
the  three  variables  can  be  replaced  by  a  meta-variable,  and  the  rest  of  the  formula  can  be 
constructed  recursively.  Thus,  to  complete  the  algorithm,  we  need  only  describe  a  technique 
for  finding  such  a  triple. 

From  the  comments  above,  we  can  assume  without  loss  of  generality  that  all  variables 
are  relevant  and  unnegated.  The  key  point,  proved  below,  is  the  following:  A/|r,ir>i*k_i(!) 
is  minimized  over  triples  x{,  Xj  and  xk  whenever  the  three  variables  are  inputs  to  the  same 
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bottom-level  gate.  Thus,  such  a  triple  can  be  found  by  estimating  1(2)  f°r  eac^ 

triple  and  choosing  the  one  with  the  smallest  estimated  value. 

Lemma  5.2  Let  f  be  a  monotone,  read-once  majority  formula  of  depth  h.  For  all  triples  of 
distinct  indices  i,  j  and  k,  let  &ijk  be  an  estimate  of  =  .4/|X|iXj,Xji_i(^),  and  assume  that 
|a«j*  -  o.;*|  <  t  <  3p/2A-M.  Suppose  that  d?r,  =  min{d,jt  :  i,j,k  distinct  }.  Then  xt,  xr  and 
x,  are  bottom-level  variables  that  are  inputs  to  the  same  gate. 

Proof:  From  Lemma  3.5,  if  xit  x,  and  xk  are  bottom-level  variables  that  are  inputs  to  the 
same  gate,  then 

Oijfc  =  (1  ~  tjo  -  Vi)\i  +  {\)h)  +  t?o- 
Otherwise,  Lemmas  3.5  and  3.6  imply  that 

aijk  >  (1  -  -  t?l)(  j  +  (2)  )  + Vo  +  3p/2/l+3. 

Since  each  d,j*  is  accurate  to  within  3p/2h+‘4,  it  follows  that  d,r>  can  be  minimal  only  if  x?,  xr 
and  x,  are  bottom-level  inputs  to  the  same  gate.  ■ 

Thus,  Lemma  5.2  gives  a  technique  for  finding  bottom-level  inputs  to  the  same  gate,  and, 
as  previously  mentioned,  the  remainder  of  the  formula  can  be  constructed  recursively  as  in 
Section  3-3.  We  thus  obtain  the  main  result  of  this  section: 

Theorem  5.3  There  exists  an  algorithm  with  the  following  properties:  Given  h,  n,  p  >  0, 
6  >  0,  and  access  to  examples  drawn  from  the  uniform  distribution  on  {0, 1}",  labeled  by  a  read- 
once  majority  formula  f  of  depth  at  most  h  on  n  variables,  and  misclassified  with  probabilities 
Tj 0  and  r)x  (as  described  above)  for  r?0  +  Vi  S  1  —  p,  the  algorithm  exactly  identifies  f  with 
probability  at  least  1  —  b.  The  algorithm’s  sample  complexity  is  0((4h/p2)  •  log(n/£)),  and  its 
time  complexity  is  0((4l>/p 2)  •  (n  +  r3)  •  log(n/6)),  where  r  is  the  number  of  relevant  variables 
appearing  in  the  target  formula. 

Finally,  we  comment  that  our  algorithms  can  be  extended  to  handle  a  fair  amount  of  ma¬ 
licious  noise.  In  this  model,  an  adversary  is  allowed  to  corrupt  each  example  in  any  manner 
he  chooses  (both  the  labels  and  the  variable  settings)  with  probability  Tj.  We  can  show  that 
the  algorithm  described  in  Sections  3-3  for  majority  formulas  can  handle  malicious  error  rates 
as  large  as  0(2"'1)  where  h  is  the  height  of  the  target  formula.  Thus,  for  logarithmic-depth 
formulas,  we  can  handle  malicious  error  rates  up  to  an  inverse  polynomial  in  the  number  of 
relevant  variables.  Similar  results  also  hold  for  nand  formulas. 

The  extension  of  the  algorithm  to  handle  malicious  noise  is  quite  simple.  The  algorithm 
of  Section  3-3  depends  only  on  accurate  estimates  of  probabilities  a  that  the  formula  outputs 
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1  under  various  distributions.  Note  that  malicious  noise  can  change  such  probabilities  by  at 
most  an  additive  factor  of  17.  That  is,  if  the  formula,  under  some  distribution,  outputs  1  with 
probability  a,  then  the  chance  that  a  positive  example  is  observed  (in  the  presence  of  malicious 
noise)  is  at  least  a  -  t)  and  at  most  a  +  r). 

Thus,  using  Chernoff  bounds  (Lemma  2-3.6),  an  estimate  of  a  that  is  accurate  to  within 
rj  +  r  can  be  obtained  from  a  sample  of  size  polynomial  in  1/r  (with  high  probability).  Since 
Theorems  3.3  and  3.7  show  that  the  required  estimates  need  only  be  accurate  to  within  3/2fc+4, 
it  follows  that  a  malicious  error  rate  of,  say,  half  this  amount  can  be  tolerated  without  increasing 
the  algorithm’s  complexity  by  more  than  constant  factors. 

3-6  Learning  unbounded-depth  formulas 

In  this  final  section  on  amplification-function  techniques,  we  describe  extensions  of  our  algo¬ 
rithms  to  learn  formulas  of  unbounded  depth  in  Valiant’s  PAC  model  with  respect  to  specific 
distributions.  As  in  the  last  section,  we  focus  only  on  majority  formulas,  omitting  the  similar 
application  of  these  techniques  to  nand  formulas. 

For  formulas  of  unbounded  depth,  exact  identification  from  the  uniform  distribution  in 
polynomial  time  is  too  much  to  ask:  For  purely  information-theoretic  reasons,  at  least  ft(2A) 
examples  must  be  drawn  from  the  uniform  distribution  to  exactly  identify  a  majority  formula 
of  depth  h.  This  can  be  proved  by  showing  (say,  by  induction  on  h )  that  if  x,-  occurs  at  level  h 
of  formula  /,  then  2~h  is  the  probability  that  an  instance  is  chosen  for  which  the  output  of  / 
depends  on  x,  (i.e.,  for  which  /’ s  output  changes  if  Xj  is  flipped).  Thus,  Sl(2h)  random  examples 
are  needed  simply  to  determine,  for  example,  whether  x,  occurs  negated  or  unnegated. 

Therefore,  to  handle  arbitrarily  deep  formulas,  we  must  relax  our  requirement  of  exact  iden¬ 
tification.  Instead,  we  adopt  Valiant’s  criterion  of  obtaining  a  good  approximation  of  the  target 
concept  (with  high  probability).  As  before,  our  algorithms  do  not  work  for  all  distributions, 
just  the  fixed-point  distribution.  We  describe  an  algorithm  that,  given  c,6  >  0  and  access  to 
random  examples  of  the  target  majority  formula  drawn  from  the  uniform  distribution,  outputs 
with  probability  1  -  6  an  e-good  hypothesis,  that  is,  one  that  agrees  with  the  target  formula 
on  a  randomly  chosen  instance  from  the  uniform  distribution  with  probability  at  least  1  —  e. 
Furthermore,  the  running  time  is  polynomial  in  1/6,  1/e  and  the  number  of  variables  n. 

We  begin  by  briefly  discussing  the  main  ideas  of  the  algorithm.  First,  as  noted  above, 
'  variables  that  occur  deep  in  the  formula  are  unimportant  in  the  sense  that  their  values  are 
unlikely  to  influence  the  formula’s  output  on  a  randomly  chosen  instance.  Intuitively,  we  would 
like  to  take  advantage  of  this  fact  by  somehow  treating  such  variables  as  irrelevant.  However, 
they  cannot  be  simply  deleted  from  the  formula  without  leaving  “holes”  that  must  in  some  way 
be  handled. 
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We  therefore  introduce  the  notion  of  a  partially  visible  function.  This  is  a  function  on  a  set 
of  visible  variables  whose  values  can  be  observed  by  the  learner,  and  a  set  of  hidden  variables 
which  are  not  observable.  With  respect  to  a  distribution  on  the  set  of  assignments  to  the 
hidden  variables,  we  say  that  two  partially  visible  Boolean  functions  are  equivalent  if,  for  all 
assignments  to  the  visible  variables,  the  probabilities  are  the  same  that  each  function  evaluates 
to  1  (where  the  probabilities  are  taken  over  random  assignments  to  the  hidden  variables).  In 
other  words,  the  behaviors  of  the  two  functions  are  indistinguishable  with  respect  to  the  visible 
variables. 

Thus,  we  handle  all  deep  variables  by  regarding  them  as  hidden  variables,  and  the  target 
formula  as  one  that  is  partially  visible.  In  particular,  insignificant  variables — those  that  occur 
below  level  h  =  flg(n/2c)] — are  considered  hidden  and  their  actual  values  ignored.  We  call 
the  partially  visible  formula  obtained  from  the  target  formula  /  in  this  manner  the  truncated 
target. 

Our  algorithm  works  by  exactly  identifying  the  truncated  target,  that  is,  by  constructing  a 
partially  visible  formula  /'  that  is  equivalent  to  it  (in  the  sense  described  above,  with  respect  to 
the  uniform  distribution).  It  can  be  shown  that  /  and  /'  agree  on  a  randomly  chosen  instance 
with  probability  at  least  1  —  c,  and  therefore  /'  is  an  e-good  hypothesis  satisfying  the  PAC 
criterion. 

It  remains  then  only  to  show  how  /'  can  be  constructed.  First,  observe  that  by  Lemma  3.2 
all  significant  variables  occurring  in  /  can  be  detected  (and  their  signs  and  levels  determined) 
in  polynomial  time.  Moreover,  by  arguments  similar  to  those  given  in  Section  3-3,  it  can  be 
shown  that  if  some  triple  of  significant  variables  meet  at  a  gate,  then  the  level  of  that  gate  can 
be  detected  from  the  amplification  function  by  hard- wiring  the  three  variables  to  1.  We  call  this 
information  (the  level  and  sign  of  each  significant  variable,  and  the  level  at  which  each  triple  of 
significant  variables  meet,  if  at  all)  the  formula’s  schedule.  It  turns  out  that  the  schedule  alone 
is  sufficient  to  fully  re-construct  the  partially  visible  formula  /',  as  is  shown  below. 

These  then  are  the  main  ideas  of  the  algorithm.  What  follows  is  a  more  detailed  exposition. 

A  partially  visible  function  f(x  :  y)  is  a  Boolean  function  /  on  a  set  of  visible  variables 
i  =  *i  •  •  • xr ,  and  a  set  of  hidden  variables  y  =  yx  •  •  -y,.  Two  partially  visible  functions  f(x  :  y) 
and  g(x  :  z )  on  the  same  set  of  visible  variables  are  equivalent  with  respect  to  distributions  D 
and  E  on  the  domains  of  y  and  z  if,  for  all  x,  Pr[/(x  :  Y)  =  1]  =  Pr[p(x  :  Z)  =  1],  where  Y 
and  Z  are  random  variables  representing  a  random  assignment  to  y  and  z  according  to  D  and 
E.  In  the  discussion  that  follows,  we  will  only  be  interested  in  uniform  distributions. 

As  described  above,  our  algorithm  regards  variables  that  occur  deep  in  the  target  formula 
as  hidden  variables.  The  next  two  lemmas  show  that  two  partially  visible  read-once  majority 
formulas  that  are  identical  except  for  some  deep  hidden  variables  are  very  likely  to  produce  the 
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same  output  on  randomly  chosen  inputs. 

Lemma  6.1  Let  f  be  a  read-once  majority  formula  on  n  variables.  Let  t  be  the  level  of  x„ 
in  f.  Let  Aj, . . .,  An_!,  Y  and  Z  be  independent  Bernoulli  variables,  each  1  with  probability 
1/2.  Then  Pr [/(A,, . . .,  A„_i;  Y)  #  f(X1, . . . ,  A„_i;  Z))  =  2"*-1 . 

Proof:  By  induction  on  t.  If  t  =  0,  then  /  is  the  function  xn  and  since  Pr[F  ^  Z]  = 
1/2,  the  lemma  holds.  If  t  >  0,  then  let  /  =  maj(/i,/2,/3),  and  suppose  that  /j  is  the 
subformula  in  which  x„  occurs.  Since  x„  does  not  also  occur  in  /2  or  /3,  we  will  regard 
these  as  functions  only  on  the  remaining  n  -  1  variables.  It  is  not  hard  to  see  then  that 
/(*„...,  A'n_i;F)  #  /(A1,...,A'n_1;Z)  if  and  only  if  f2(Xu . . .,  Xn.x)  #  /3(  A'1,...,A„_1) 
and  fi(Xu...,Xn.i‘,Y)  ^  fi(Xu . . .,  .Yn_x;  Z).  Since  /2  and  /3  each  output  1  independently 
with  probability  1  /2,  we  have 

Pr[/2( . . .,  An_,)  *  /3( Xu . .  • ,  !„->)]  =  1/2. 

Also,  by  inductive  hypothesis, 

Pr[/1(A1,...,Jtn_l;y)#/1(A1 . An_i;Z)]  =  2"‘. 

The  lemma  then  follows  by  independence.  ■ 

Lemma  6.2  Let  f  be  a  read-once  majority  formula  on  n  variables.  Let  t,  be  the  level  of 
variable  x,  in  f.  Let  A 1 , . . . ,  An ;  A” J , . . . ,  A',  r  <  n,  be  independent  Bernoulli  variables,  each  1 
with  probability  1/2.  Then  Pr^A, . A„)  ^  f(X[ . A;,Ar+1 . A„)]  < 

Proof:  By  induction  on  r.  If  r  =  0,  then  the  lemma  holds  trivially.  For  r  >  0,  we  have 


Pr[/(A„ . . . ,  An)  /  /( a;,  . . . ,  a;,  Ar+1, . . .,  An)] 

<  Pr[/(A„...,An)#/(A; . A;_j,Ar„..,A„)] 

+Prl/(A;,...,A;_1,Ar,...,A„)#/(A1',...,A;,Ar+1 . An)] 

r-l 

<  +  2“,r_1 


i=i 


where  the  last  inequality  follows  from  our  inductive  hypothesis  and  the  preceding  lemma.  ■ 
As  proved  below,  Lemma  6.2  implies  that  any  partially  visible  formula  is  an  c-good  hypoth¬ 
esis  if  it  is  equivalent  to  the  truncated  target,  the  partially  visible  formula  obtained  from  the 
target  formula  by  regarding  all  variables  at  or  below  level  h  =  |'lg(n/2e)"|  as  hidden  variables. 
Given  an  assignment  to  the  visible  variables,  such  a  hypothesis  is  evaluated  in  the  obvious 
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manner  by  choosing  a  random  assignment  to  the  hidden  variables  and  computing  the  output 
of  the  formula  on  the  combined  assignments  to  the  hidden  and  visible  variables.  (Thus,  the 
hypothesis  is  likely  to  be  randomized.) 

Lemma  8.3  Let  e  >  0,  and  let  f  be  a  read-once  majority  formula  on  n  variables.  Let  x  = 
xi---xr  be  the  variables  occurring  above  level  h  =  [lg(n/2c)],  and  let  y  =  yi  •••yn-r  be  the 
remaining  variables.  Let  g(x  :  z)  be  any  partially  visible  formula  equivalent  to  the  partially 
visible  formula  f(x  :  y).  Then  Pr[f(X  :  Y)  ^  g(X  :  Z) ]  <  c,  where  X,  Y  and  Z  are  random 
variables  representing  the  uniformly  random  choice  of  assignments  to  x,  y  and  z.  That  is, 
g(x  :  z)  is  an  e-good  hypothesis  for  f. 

Proof:  Let  Y'  be  a  random  variable  representing  a  random  assignment  to  y,  chosen  indepen¬ 
dently  of  Y.  Since  f(x  :  y)  is  equivalent  to  g(x  :  z ),  we  have 


Pr[/(X  :  Y)  #  g(X  :  Z)}  =  Pr[/(X  :  Y)  #  f(X  :  Y% 


By  Lemma  6.2,  the  right  hand  side  of  this  equation  is  bounded  by  e,  since  each  of  the  n  —  r  <  n 
variables  y*  occurs  at  or  below  level  h  in  /.  ® 

For  e  >  0  and  target  formula  /,  we  will  henceforth  say  that  variables  occurring  above  level 

h  =  flg(n/2f)]  are  significant.  Note  that  Theorem  3.3  implies  that  the  significance,  sign  and 
level  of  any  variable  x;  can  be  determined  by  hard- wiring  that  variable  to  1,  as  usual.  More 
specifically,  if  a  is  an  estimate  of  a  =  for  which  |q  -  a|  <  r  <  (^)  then  x;-  is 

significant  if  and  only  if  |d  -  ||  >  (^)h+1  +  r,  and,  if  it  is  significant,  then  its  sign  and  level  can 
be  determined  as  in  Theorem  3.3. 

Similar  to  Theorem  3.7,  we  can  show  that,  for  any  triple  of  significant  variables,  we  can 
determine  the  level  of  the  gate  at  which  the  triple  meets,  if  at  all.  More  precisely,  if  x j,  x; 
and  x*  are  three  unnegated  variables  occurring  at  levels  1 1,  and  t3,  and  if  a  is  an  estimate 
of  a  =  for  which  |d  -  aj  <  r  <  2-3h,  then  it  follows  from  Lemmas  3.5  and  3.6 

that  x,,  Xj  and  x*  meet  at  a  level-d  gate  if  and  only  if 


\  +  (i),,+l  +  +  (f)‘5+1  -  (!)“+'’+‘’-2d-1  _  a 


<  T. 


As  mentioned  above,  we  call  the  sum  total  of  this  information — the  significance  of  each 
variable,  the  level  and  sign  of  each  significant  variable,  and  the  level  of  the  gate  at  which  each 
triple  of  significant  variables  meet,  if  at  all — the  formula’s  schedule.  It  remains  then  only  to 
show  how  an  c-good  hypothesis  can  be  constructed  from  the  schedule.  Specifically,  we  show  how 
to  construct  a  partially  visible  formula  that  is  equivalent  to  the  truncated  target  /(x  :  y).  (Here, 
x  is  the  vector  of  visible  (i.e.,  significant)  variables,  and  y  is  the  vector  of  hidden  (insignificant) 
variables.) 
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Suppose  first  that  no  three  visible  variables  meet  in  /.  Such  a  formula  is  said  to  be  unstruc¬ 
tured.  Although  strictly  speaking  /  cannot  be  unstructured  (by  our  choice  of  h ),  this  special 
case  turns  out  nevertheless  to  be  important  in  handling  the  more  general  case  since  subformulas 
of  /  may  be  unstructured. 

Lemma  6.4  below  shows  that  an  unstructured  formula  f(x  :  y )  is  equivalent  to  any  other 
unstructured  partially  visible  formula  whenever  each  visible  variable  occurs  at  the  same  level 
with  the  same  sign  in  both  formulas.  Thus,  unstructured  formulas  are  not  changed  when 
visible  variables  are  moved  around  within  the  same  level.  This  fact  makes  the  identification  of 
unstructured  formulas  from  their  schedules  quite  easy. 

For  any  partially  visible  read-once  majority  formula  /( x  :  y ),  let  pj(x)  —  Pr[/(x  :  Y)  =  1] 
where  Y  represents  a  random  assignment  to  y. 

Lemma  6.4  Let  f(x  :  y)  and  g(x  :  z)  be  unstructured  read-once  majority  formulas  on  s  visible 
variables.  Suppose  that  each  visible  variable  x}  is  relevant  and  occurs  at  the  same  level  tj  with 
the  same  sign  in  both  formulas.  Then  the  two  partially  visible  formulas  are  equivalent. 

Proof:  It  suffices  to  prove  the  lemma  when  no  visible  variable  is  negated  since  negated  variables 
can  simply  be  replaced  by  unnegated  meta-variables. 

To  prove  the  lemma,  we  show  that 

p/(x) = * + -  2)-  t6-1) 

«=i 

Since  this  statement  applies  to  any  unstructured  formula,  it  follows  immediately  that  pj(x)  = 
p,(x)  and  the  two  partially  visible  formulas  are  equivalent. 

We  prove  Equation  (6.1)  by  induction  on  the  height  h  of  /.  If  h  =  0,  then  /  consists  of 
a  single  visible  or  hidden  variable.  If  /  is  the  formula  xj,  where  x,  is  some  visible  variable, 
then  p/(x)  =  Xj,  satisfying  (6.1).  If  /  is  the  formula  yj,  where  y j  is  a  hidden  variable,  then 
Pj(x)  =  also  satisfying  (6.1). 

If  h  >  0,  then  let  /  =  maj(/i, /2,  /3)  where  /j,  fi  and  /3  arc  partially  visible  subformulas. 
Since  /  is  unstructured,  one  of  these  (say  /3)  contains  no  visible  variables,  and  thus  p/,(x)  = 
Suppose  without  loss  of  generality  that  xj,. .  .,xr  are  the  visible  variables  relevant  to  f\.  Then, 
by  inductive  hypothesis, 

1=1 

Ph(x)  =  2  +  £  2-‘>+1(x,-  -  i). 

i=r+' 


and 
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Applying  Lemma  3.1,  it  is  easily  verified  that  Equation  (6.1)  is  satisfied,  completing  the  induc¬ 
tion.  ■ 

Thus,  if  f(x  :  y)  is  unstructured,  then  an  equivalent  unstructured  formula  can  be  constructed 
from  /h>  schedule.  For  instance,  here  is  an  efficient  algorithm:  Let  t  be  the  depth  of  the  deepest 
visible  variable  in  /.  Break  the  set  of  all  level-f  variables  into  pairs.  Replace  each  such  pair 
Xi,Xj  by  a  level-(t  —  1)  meta- variable  w  =  MAJ (xj,x;-,  y),  where  y  is  a  new  hidden  variable.  If  an 
odd  level-f  variable  i*  remains,  replace  it  with  a  level-(f  -  1)  meta-variable  w  =  maj(x;,  y,  y'), 
where  y  and  y‘  are  new  hidden  variables.  Repeat  for  levels  t  —  l,f  —  2, . . 1.  It  is  not  hard  to 
show  that  this  algorithm  results  in  a  formula  that  is  unstructured,  and  that  is  consistent  with 
/’ s  schedule  (and  so  is  equivalent). 

With  these  tools  in  hand  for  dealing  with  unstructured  formulas,  we  are  now  ready  to 
describe  an  algorithm  for  handling  the  general  case,  i.e.,  for  reconstructing  any  (not  necessarily 
unstructured)  formula  from  its  schedule. 

Let  f(x  :  y)  be  the  truncated  target.  If  /  is  unstructured,  then  the  previous  algorithm 
applies.  Otherwise,  we  can  find  from  the  schedule  three  visible  variables  x,,  ij  and  xk  which 
meet  at  some  maximum-depth  gate  A  of  /;  that  is,  they  meet  at  a  level-d  gate,  and  no  triple 
of  visible  varia  vies  meet  at  any  gate  of  depth  exceeding  d.  Then  the  subformula  g  subsumed 
by  A  computes  the  majority  of  three  subformulas  gu  g2  and  y3,  each  containing  one  of  X*,  Xj 
and  xt  (say,  in  that  order).  Let  x*  be  some  other  visible  variable.  Then  it  is  easily  verified  that 
xe  is  relevant  to  gx  if  and  only  if  xt,  Xj  and  x*  meet  at  a  level-d  gate  (name1,-,  A).  Thus,  all 
of  the  visible  variables  relevant  to  <71  (and  likewise  for  y2  and  g3)  can  be  determined  from  the 
schedule.  Moreover,  note  that  each  of  these  subformulas  is  unstructured  since  A  is  of  maximum 
depth.  Thus,  each  subformula  can  be  identified  using  the  previous  algorithm  for  unstructured 
formulas,  and  therefore,  the  entire  subformula  subsumed  by  (and  including)  A  can  be  identified. 

The  rest  of  the  formula  can  be  identified  recursively:  we  replace  subformula  g  by  a  new 
meta-variable  w,  and  update  the  schedule  appropriately. 

This  completes  the  algorithm.  The  sample  complexity  can  be  derived,  as  usual,  using 
Chemoff  bounds  (Lemma  2-3.6),  and  the  time  analysis  is  straightforward.  We  thus  have: 

Theorem  6.5  There  exists  an  algorithm  with  the  following  properties:  Given  n,  6  >  0,  and 
access  to  examples  drawn  from  the  uniform  distribution  on  {0, 1}"  and  labeled  by  any  read-once 
majority  formula  f  on  n  variables,  the  algorithm  exactly  identifies  f  with  'robability  at  least 
1  -  6.  The  algorithm’s  sample  complexity  is  0((n/e)6  ■  log(n/£)),  and  its  time  complexity  is 
0((n9 / c6)  •  log(n/£ )). 
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3-7  Learning  probabilistic  read-once  formulas 

In  this  section,  we  extend  the  techniques  of  the  preceding  sections  to  a  broad  class  of  probabilistic 
concepts ,  which  includes  the  class  of  all  read-once  formulas  over  the  usual  basis  {and,  OR,  not). 
We  show  this  class  is  PAC  learnable  against  all  product  dist  'ibutions  (i.e.,  all  distributions  in 
which  the  assignment  to  each  variable  is  independent  of  the  settings  of  the  other  variables). 

As  described  in  detail  in  Chapter  4,  a  probabilistic  concept  (p-concept)  is  a  function  c  :  X  — ► 
[0, 1]  where  X  is  the  domain.  (In  this  chapter,  X  is  always  {0, 1}".)  The  interpretation  here  is 
that  c(x)  is  the  probability  that  an  instance  x  G  X  is  labeled  1,  and  1  —  c(x)  is  the  probability 
it  is  labeled  0.  Thus,  we  assume  an  oracle  EX  which  first  chooses  x  6  X  randomly  according 
to  some  target  distribution  D ,  and  then  randomly  labels  x  according  to  c  as  just  described. 

In  this  section,  we  will  be  interested  in  the  problem  of  learning  with  a  model  of  probability. 
Here,  the  goal  is  to  infer  a  good  approximation  of  the  function  c  itself.  Specifically,  given 
positive  e  and  6,  we  ask  that,  with  probability  at  least  1  —  6,  the  learning  algorithm  find  an 
e-good  model  of  probability  for  /,  i.e.,  a  real-valued  hypothesis  h  such  that 

E*€d  [\h(x)  -  c(x)|]  <  e.  (7.1) 

Furthermore,  the  learning  algorithm’s  running  time  must  be  polynomial  in  1/e,  1/6  and  n. 
(This  definition  differs  slightly,  but  is  equivalent  to,  the  definition  of  learning  with  a  model  of 
probability  given  in  Chapter  4.  See  Section  4-2.) 

We  describe  in  this  section  an  algorithm  for  learning  with  a  model  of  probability  any  p- 
concept  in  a  particular  class  of  p-concepts  against  any  product  distribution  on  the  domain 
(0, 1}",  i.e.,  any  distribution  in  which  the  setting  of  each  bit  x,  is  chosen  independently  of  the 
settings  of  the  other  bits. 

The  p-concept  class  of  interest  is  the  class  of  real-valued  read-once  formulas  over  the  basis 
{mul,  LINZU, }  where  MUl  denotes  ordinary  multiplication  of  two  real  numbers,  and  LiNiU)  is  the 
unary  operator 

UN,..,(y)  =  r  +  try. 

Here,  x  and  w  may  be  any  real  numbers  for  which  x  and  x  +  w  are  both  in  the  range  [0, 1].  We 
call  formulas  over  this  basis  real  formulas. 

For  instance, 

LiNo,  25(MUL(LIN  5.  s(^i),  LiNi,_i(x2)))  =  .25  •  (.5  +  .5x0  •  (1  -  x2)  (7.2) 

is  a  read-once  real  formula. 

An  easy  induction  argument  shows  that  real  formulas  have  range  [0, 1],  and  so  are  p-concepts. 
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Also,  note  that  for  Boolean-valued  inputs,  mul  is  equivalent  to  AND,  and  LIN i,_i  is  equivalent 
to  not.  Thus,  the  class  of  read-once  real  formulas  includes  the  class  of  read-once  Boolean 
formulas  over  the  basis  {and,  not},  or,  equivalently,  {and, or,  not}. 

Further,  our  criterion  (7.1)  for  good  learning  of  real  formulas  implies  the  usual  PAC  learn- 
ability  of  Boolean  formulas:  if  c  is  a  deterministic  concept  (i.e.,  a  p-concept  with  range  in 
{0, 1}),  and  h  is  a  real-valued  hypothesis  satisfying  (7.1),  then  h'  =  round(h)  =  [h  +  1/2J  is  a 
2c-good  hypothesis  for  c  since 

Pr,€0  [*'(*)  #  c(x)]  <  Prr6£>  [| h(x)  -  c(x) \  >  1/2] 
and,  by  Markov’s  inequality, 

e  >  Er€D  [\h(x)  -  c(i)|]  >  f  Prr€B  [| h(x)  -  c(*)|  >  1/2] . 

Thus,  the  results  in  this  section  subsume  those  in  Section  3-4. 

Note  that  the  class  of  read-once  real  formulas  also  includes  the  class  of  Boolean  formulas 
which  have  been  corrupted  with  the  kind  of  random  misclassification  noise  described  in  Sec¬ 
tion  3-5.  This  is  because  such  noise  can  be  simulated  by  a  single,  output-level  gate  LIN,0  : 

on  input  0,  this  gate  outputs  r) 0,  and  on  input  1,  it  outputs  1  -  T}x.  Thus,  7/0  and  r)x  are  the 
respective  probabilities  that  an  input  of  0  or  1  is  “misclassified”  or  flipped  by  this  gate. 

In  general,  we  can  regard  gates  LINiU,  as  describing  the  behavior  of  a  “noisy”  or  randomized 
Boolean  gate  which  on  input  0  outputs  1  with  probability  z ,  and  on  input  1  outputs  1  with 
probability  z  -f  w.  Clearly,  if  the  input  to  such  a  randomized  gate  is  1  with  probability  p,  then 
the  gate  outputs  1  with  probability 

(1  -  p)z  +  p(z  4-  u;)  =  z  +  wp  =  LlNJU,(p). 

Thus,  the  probabilistic  behavior  of  a  formula  constructed  with  such  randomized  gates  is  de¬ 
scribed  by  the  p-concept  obtained  by  replacing  each  randomized  gate  by  a  lin*„,  gate,  for  an 
appropriate  choice  of  z  and  w.  (This  can  be  proved  rigorously,  for  instance,  using  a  straight¬ 
forward  induction  argument  on  the  depth  of  the  formula.) 

Thus,  our  result  can  be  viewed  as  a  demonstration  of  the  learnability  of  read-once  Boolean 
formulas  with  large  doses  of  noise  sprinkled  throughout  the  formulas.  Such  noise  may  affect  the 
formula’s  output  (“misclassification  noise”),  the  inputs  (“attribute  noise”)  or  it  might  affect 
the  output  of  every  gate  of  the  formula. 


3-7 


Learning  probabilistic  read-once  formulas  83 


3-7.1  Overview  of  the  learning  algorithm 

Our  learning  procedure  uses  many  of  the  ideas  and  techniques  developed  in  the  preceding 
sections.  The  algorithm  operates  in  three  stages.  In  Stage  I,  we  estimate  the  probability  p* 
that  each  variable  x<  is  set  to  1 .  Any  “sticky”  variables  for  which  this  probability  is  too  close  to 
0  or  1  will  be  disregarded  in  later  stages.  We  also  determine  the  “influence”  of  each  variable  x,: 
roughly  speaking,  this  is  the  probability  that  the  target  formula’s  value  changes  significantly  if 
x/s  setting  is  flipped.  Variables  which  have  very  little  influence  are  also  considered  irrelevant 
in  later  stages,  similar  to  the  manner  in  which  insignificant  (deeply  occurring)  variables  are 
ignored  in  the  algorithm  of  Section  3-6.  Not  surprisingly,  sticky  and  uninfluential  variables  can 
be  ignored  without  introducing  much  error. 

In  Stage  II,  we  construct  an  approximation  of  the  target  formula’s  topology  or  skeleton 
structure.  By  the  skeleton  of  a  real  formula,  we  refer  to  the  topological  structure  of  the  formula, 
i.e.,  the  formula  stripped  of  the  z,  tu-values  on  the  LIN  gates.  For  instance,  the  formula  in  (7.2) 
has  skeleton: 

LIN(MUL(LIN(x!),  lin(x2))). 

In  Stage  II,  we  infer  a  skeleton  a  which  we  show  approximates  the  target  formula  in  the 
sense  that  a  is  the  skeleton  of  some  formula  which  is  a  good  approximation  (say,  in  the  sense 
of  (7.1))  of  the  target  formula. 

Finally,  in  Stage  III,  we  approximate  the  z,  w  values  of  the  lin  gates  of  the  skeleton  inferred 
in  Stage  II. 


3-7.2  Some  preliminary  facts 

Let  /  be  the  target  formula,  and  let  D  be  the  target  distribution.  In  what  follows,  all  expecta¬ 
tions  are  taken  with  respect  to  distribution  D.  For  a  function  g,  we  also  sometimes  write  g  to 
denote  its  expectation: 

9  =  E[s]  =  Er€D  [$(*)]. 

We  will  be  interested  in  the  partial  derivatives  of  /,  which  turn  out  to  be  easily  computed 
and  their  expectations  easily  approximated  by  sampling.  We  say  that  a  gate  A  is  an  uncle  of 
variable  x,  if  A  is  an  immediate  input  to  a  mul  gate  fed  by  X*,  but  A  is  not  itself  fed  by  x,. 

Lemma  7.1  Let  x,  be  a  relevant  variable  of  f.  Let  {lin,jiU)j}  be  the  sequence  of  LIN  gates  fed 
by  Xi,  and  let  {<7;}  be  the  sequence  of  subformulas  subsumed  by  uncles  of  X{.  Then 


d]_ 

dii 


= n  ■  ru- 

j  j 


The  same  holds  if  x,  is  replaced  by  a  subformula  of  f. 
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Proof:  By  induction  on  the  depth  of  /.  If  the  depth  is  zero,  then  /  =  x,,  and  the  lemma  holds. 
(The  “empty”  product  is  1.) 

Otherwise,  if  the  output  gate  of  /  is  a  linju,  gate,  then  /  =  lin2U,(/')  =  z  +  wf0,  for 
some  subformula  /'.  (The  notation  /'  is  potentially  confusing  since  /'  sometimes  denotes  the 
derivative  of  /.  However,  here  and  throughout  this  chapter,  /'  simply  denotes  a  function  that,  in 
general,  is  unrelated  to  any  derivative  of  /.)  Thus,  df/dxi  =  w-{df  /  dx,)  and  the  lemma  holds 
by  inductive  hypothesis.  If  the  output  gate  is  a  mul  gate,  then  /  =  f'g ,  for  some  subformulas 
/'  and  g.  Variable  x,  is  relevant  to  only,  say,  /'.  Thus,  g  is  a  subformula  subsumed  by  an  uncle 
of  x,-,  and  since  x,  is  not  relevant  to  g,  we  have  df/dxi  =  g  •( df'/dxi ).  Thus,  the  lemma  holds 
again  in  this  case  by  inductive  hypothesis. 

Regarding  a  subformula  g  of  /  as  a  meta-variable,  it  can  be  seen  that  this  same  argument 
holds  if  Xj  is  replaced  by  subformula  g.  ■ 

Since  each  gj  in  this  lemma  is  itself  a  real  formula,  gj  has  range  in  [0, 1].  Also,  \%Vj  |  <  1  for 
each  Wj.  Thus,  it  follows  from  Lemma  7.1  that  \df /dxi\  <  1,  and  that  the  sign  of  df/dx,  is 
determined  by  the  w,' s.  Thus,  df/dxi  is  either  a  nonpositive  or  nonnegative  function. 

Note  that  the  formula  /  could  be  “multiplied  out”  to  give  a  polynomial  over  the  variable 
set.  This  polynomial  is  linear  in  each  variable  x,.  This  follows,  for  instance,  from  Lemma  7.1 
since  x,  is  not  relevant  to  any  of  the  functions  gj,  and  so  d2f/dx ?  =  0.  Thus,  we  can  write 

/  =  u  - f  VXi 

for  some  functions  u  and  v  to  which  x,  is  not  relevant.  We  call  u  and  v  the  decomposition  of  / 
in  terms  of  X*.  In  the  same  manner,  /  can  be  decomposed  in  terms  of  any  subformula  g,  and  so 
can  be  written  /  =  u  +  vg  for  some  functions  u  and  v  that  do  not  contain  any  variable  relevant 
to  g. 

Clearly,  if  /  =  u  +  t>x,,  then 

|£  =  »  =  (/|xi‘-l)-(/|xi*-0).  (7.3) 

Since,  as  noted  above,  df/dii  is  either  nonpositive  or  nonnegative  on  all  inputs, 

|E  [df/dx,] |  =  E  [\df/dx,\]  =  E  [|(/|x,  -  1)  -  (f\Xi  —0)|] . 

This  latter  expression  is  a  natural  measure  of  the  influence  of  Xj,  the  degree  to  which  x,’s  value 
affects  the  value  of  /. 

Also,  equation  (7.3)  implies  that 


E  [df/dx,]  =  E  [/|x, «-  1)  -  E  [f\Xi  -  0] . 


(7.4) 


I 


3- 
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Note  that  the  expressions  on  the  right  can  be  easily  approximated  by  simply  estimating  the 
probability  that  a  positive  example  is  received  when  x,  is  hardwired  to  0  or  1  (assuming  z,  is 
not  sticky).  Thus,  the  expected  value  ol  df/dii  can  be  easily  estimated  as  the  difference  of 
these  estimates. 

In  Stages  II  and  III,  we  will  also  be  interested  in  the  expected  value  of  the  the  second  partial 
derivatives;  these  values  turn  out  to  be  quite  useful  to  our  algorithm.  If »  /  j,  then  since  /  is 
linear  in  every  variable,  /  can  be  written  as 

f  =  U  o  +  Ul  Xi  +  UiXj  +  UZXiXj 

for  some  functions  Uo,  «j,  u2  and  u3  to  which  z,  and  Xj  are  irrelevant.  Then  it  is  easily  verified 
that 

d  /  =«3  =  (f\xi*-l,xi  —  l)-(f\xi*-l,xj*-Q) 

OXiOXj 

-(/|x,  <— 0,  x,  - 1)  +  (/|*i«-0,  Xj  -0).  (7.5) 

Since,  as  before,  the  expectation  of  each  expression  on  the  right  can  be  estimated  by  sampling 
on  filtered  distributions,  we  can  obtain  a  good  estimate  of  E  [d2  f  /dxidxj]. 

Note  that  Lemma  7.1  shows  that  either  df/dxi  or  its  negation  is  a  read-once  real  formula. 
Thus,  in  either  case,  the  lemma  implies  that  the  second  partial  derivative  d2f  /dx,dij  has  many 
of  the  properties  of  the  first  derivative  described  above:  in  particular,  its  magnitude  is  bounded 
by  1,  and  it  is  nonnegative  or  nonpositive  on  all  inputs. 

3-7.3  Stage  I:  Eliminating  sticky  and  uninfluential  variables 

As  described  above,  in  Stage  I,  sticky  and  uninfluential  variables  are  eliminated.  Our  algo¬ 
rithm  begins  by  estimating  the  probability  p<  that  each  variable  x,  is  set  to  1  under  the  target 
distribution.  Applying  Chernoff  bounds  (Lemma  2-3.6),  we  see  that  a  polynomial-size  sample 
suffices  to  obtain  estimates  p,  such  that,  with  probability  at  least  1  —  6/4,  every  estimate  pt  is 
such  that  |p,  -  pi\  <  c/12 n.  If  pt  <  e/4 n  or  p<  >  1  -  e/4n,  then  we  say  that  x,  is  sticky.  In 
this  case,  assuming  the  accuracy  of  our  estimates,  either  p<  <  e/3 n  or  1  -  p*  <  e/3n.  Sticky 
variables  are  ignored  in  later  stages  of  the  algorithm. 

If  Xj  is  not  sticky,  then  e/6 n  <  Pi  <  1  —  e/6 n.  In  this  case,  a  good  estimate  of  E[/|xj«—  b] 
can  be  obtained  for  b  €  {0, 1}.  We  require  that  these  estimates  have  accuracy  e8/2(9n)10.  (This 
high  degree  of  accuracy  is  necessary  for  later  stages  of  the  algorithm.)  With  probability  at 
least  1  —  6/4,  such  estimates  can  be  obtained  for  all  unsticky  variables  using  a  polynomial-size 
sample  (again  by  applying  Chernoff  bounds). 

These  estimates  can  in  turn  be  used  to  obtain  estimates  Bi  of  the  expected  value  of  Bi  = 
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dfldxt  using  equation  (7.4).  We  then  have  1 5,  -  B,|  <  c8/(9 n)10.  If  |b<|  >  «/4n,  we  say  that 
Xi  is  influential ;  in  this  case,  |B,|  >  c/5n.  In  later  stages,  uninfluential  variables  are  ignored; 
for  these  variables,  |fli|  <  c/3n. 

We  show  next  that  sticky  and  uninfluential  variables  can  be  ignored  without  introducing 
much  error.  Let  f  be  the  read-once  real  formula  obtained  from  /  by  replacing  some  variable  x, 
by  the  constant  p,  (or,  equivalently,  since  this  is  not  technically  a  real  formula,  by  linPi  0(x,)). 
Formula  /'  is  just  the  p-concept  obtained  by  regarding  x*  as  a  hidden  variable  —  if  we  ignore  ij’s 
value,  then  /'  describes  the  probability  that  an  assignment  to  the  other  variables  is  labeled  1. 
Note  that  g  =  g'  for  every  subformula  g  of  f  and  its  corresponding  subformula  g'  of  /';  this  can 
be  proved  by  an  easy  induction  argument  on  the  depth  of  g.  Applying  Lemma  7.1,  this  shows 
in  particular  that  the  influence  of  any  other  variable  Xj  is  the  same  in  /  and  /'. 

Let  u,  v  be  the  decomposition  of  /  in  terms  of  x<  so  that  f  -  u- f  vx,.  Then  /'  =  u  +  up,-, 
and  so 

f  ~  f  -  v(x,  -  Pi)  =  ( df/dxi)(xi  -  pi). 

Thus,  by  independence, 

E  [|/  -  /'|]  =  E[\df/dxi\]-E[\xl-p,\)  =  \B,\-2p,(l-Pi). 

Note  that  if  x,  is  sticky  or  uninfluential,  then  this  latter  expression  is  at  most  2c/3 n. 

Suppose  ij, . . . ,  x,  are  the  variables  of  /  which  are  either  sticky  or  uninfluential.  Let  /0  =  /, 
and  let  fl  be  obtained  from  /,_ i  by  replacing  xt  with  the  constant  pi  for  1  <  i  <  s,  and  let 
h  =  /.-  Then 

E(|/-/i|]  <  EEH 

i=i 

<  2se/3n  <  2c/3, 

since,  by  the  preceding  argument.  E[|/j  -  /,-i|]  <  2e/3 n. 

As  in  Section  3-6,  we  henceforth  regard  f\  as  the  target  formula.  Sampling  according  to  f\ 
is  achieved  by  simply  ignoring  the  variables  eliminated  from  /.  In  the  later  stages,  it  is  shown 
how  to  find  a  hypothesis  /  such  that  E  |j/  -  /i|]  <  c/3,  and  thus  E  ^|/  -  /|j  <  c. 

Note  that  we  can  easily  handle  at  this  point  the  special  case  that  all  the  variables  of  /  are 
eliminated.  For  this  case,  f\  simply  computes  some  constant  function  p.  Since  /  =  f\  =  p,  we 
can  with  high  probability  obtain  an  estimate  p  of  p  so  that  |p  -  p|  <  c/3.  Letting  /  =  p  be  our 
hypothesis,  we  have  that  E  ||  /  —  /i||  =  |p  -  p|,  and  thus  E  [  /  —  /  j  <  c  as  desired. 
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3-7.4  Stage  II:  Inferring  the  formula’s  skeleton 

Based  on  the  results  of  Stage  I,  we  can  assume  henceforth  that  none  of  the  variables  of  / 
are  either  sticky  or  uninfluential.  Thus,  in  this  section,  /  actually  refers  to  the  formula  f\  of 
the  previous  section,  and  all  of  the  variables  discussed  are  assumed  to  be  neither  sticky  nor 
uninfluential. 

We  show  in  this  section  how  an  approximation  of  the  skeletal  structure  of  /  can  be  obtained. 
Specifically,  we  show  how  a  structure  a  can  be  inferred  which  is  the  skeleton  of  some  formula 
fll  for  which  \f  —  fu\  is  very  small  on  all  inputs.  In  Stage  III,  we  will  see  how  a  formula  very 
close  to  fu  can  be  inferred  from  a  and  other  statistical  information. 

Note  that  the  functional  composition  of  two  LIN  gates  is  a  LIN  gate,  and  that  lin0,i  is  the 
identity  function.  Thus,  without  loss  of  generality,  we  assume  the  gates  of  /  occur  in  alternating 
layers  of  MUL  and  lin  gates,  the  output  gate  being  a  LIN  gate,  and  every  variable  an  input  to 
a  LIN  gate.  Thus,  the  topology  or  skeleton  of  /  is  entirely  determined  by  the  tree  structure  of 
/  with  respect  to  the  MUL  gates.  Therefore,  in  reconstructing  /’ s  skeleton,  we  will  be  quite 
interested  in  determining  which  MUL  gates  are  fed  by  which  other  mul  gates,  and  in  particular, 
which  of  two  mul  gates  occurs  deeper  in  the  formula. 

Our  algorithm  uses  two  tests  for  determining  which  of  two  mul  gates  is  deeper.  After 
describing  the  two  tests,  we  show  how  a  skeleton  can  be  constructed  from  the  results  of  these 
tests. 

To  simplify  notation,  we  let  r,;  =  r(xi,ij),  and  we  write  gti  to  denote  the  subformula 
subsumed  by 

For  any  three  variables  x,,  x}  and  zt,  we  must  have  that  two  of  the  gates  I\;,  T,*  and 
are  actually  the  same,  and  that  the  remaining  gate  is  the  deepest  of  the  three.  For  instance,  if 
Tjj  is  the  deepest  gate,  then  it  feeds  Tu  =  Our  purpose,  initially,  is  to  determine  which  of 
these  gates  is  deepest.  This  will  be  determined  from  the  expected  values  of  the  first  and  second 
partial  derivatives  of  /. 

Recall  that  good  estimates  of  B,  =  E  [df/dxi]  were  obtained  in  Stage  I.  Let  A,;  = 
d2 f  /diidij.  From  equation  (7.5),  it  follows  that  good  estimates  A,,  of  .4^  can  also  be  obtained. 
Specifically,  for  each  pair  of  variables,  and  each  6, ,  b2  €  {0,1},  we  estimate  E  [f\xi «—  b i ,  x;  —  62] 
to  within  accuracy  c8/4  •  (9n)10.  Such  accuracy  can  be  achieved  (for  all  unsticky  variables) 
with  probability  at  least  1-6/4  using  a  polynomial- size  sample.  Estimates  of  A,,  can  then  be 
derived  using  equation  (7.5)  since 


d2f 


^  \diidxj. 

=  Ef/lx,  *-!,*,  *-!]  -  E[/|x,  —  l,x,<-0]  -  E[/|z<*-0, *;<-!]  +  E[/|x,  —  0,x;  —  0]. 
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/ 


Xi 


Figure  3:  The  decomposition  of  /  used  in  Lemma  7.2. 

We  assume  henceforth  that  all  of  the  estimates  have  the  desired  accuracy.  Then  estimates  of 
Aij  derived  in  this  manner  are  such  that  |>i,j  -  Ai}  |  <  c8/(9n)10.  It  will  also  be  convenient  to 
assume  that  each  Bt  and  A is  in  the  range  (-1, 1];  since  Bi  and  A^  are  known  to  be  in  this 
range,  we  make  this  assumption  without  loss  of  generality.  ( “Clamping”  estimates  in  this  range 
can  only  improve  their  accuracy.) 

The  sign  test 

Our  first  test  for  determining  which  of  two  gates  is  deeper  in  /  is  called  the  sign  test ,  and  it  is 
based  on  our  ability  to  determine  the  sign  of  certain  partial  derivatives  as  described  below. 

Specifically,  for  variables  x,  and  x;,  we  show  below  that  it  is  possible  to  determine  the  sign 
of  E  [df  /dgij],  which  we  denote  by  c,j.  To  see  that  this  might  be  useful,  note  that  if  T,*  =  I";* 
then  certainly  gik  =  gik,  and  so  cik  =  Cjk.  Thus,  if  for  some  triple  x,,  Xj  and  xk  we  find  that 
Cjj  /  cik  =  cjk,  then  we  can  conclude  that  occurs  deeper  in  /  than  rit  =  (Recall  that 
exactly  two  of  the  gates  Ti;,  I\t  and  must  be  equal  to  one  another;  since  c>;  ^  c^  =  Cjk, 
this  is  the  only  possibility.)  This  is  the  essence  of  the  sign  test. 

Lemma  7.2  Let  x,  and  x;  be  distinct  variables.  Then 


sign(E[df/dg,j])  =  signify  •  Bj  ■  ii;). 


Proof:  Since  pairs  of  variables  can  only  meet  at  mul  gates,  we  can  write  g,j  as  a  product  of  its 
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inputs,  gtj  =  hih2,  where  hi  and  h2  are  subformulas  containing  x*  and  Xj,  respectively.  We  can 
decompose  hi  and  h2  in  terms  of  x*  and  Xj,  and  so  can  write  hi  =  Ui  +  ViX*  and  h2  =  u2  -f  v2Xj. 
We  can  also  decompose  /  in  terms  of  g^:  f  =  u0  +  (This  decomposition  is  summarized 

in  Figure  3.) 

Thus,  we  can  easily  compute  that: 

Bi  =  v0Vih2 
Bj  =  v0hiv2 
4  =  v0viv2. 

We  have  that  |u0|  <  1  by  Lemma  7.1,  and  hi  and  h2,  being  subformulas,  are  in  the  range  [0, 1]. 
Thus, 

141  =  >  \vlviv2hih2\  =  \Bi\\Bj\- 

This  last  expression  is  at  least  (c/5 n)2  since  x*  and  X,  are  influential.  Thus,  szgn(4)  = 
sign(Aij).  Likewise,  the  signs  of  fl,  and  Bj  can  be  determined  from  their  estimates. 

From  the  expressions  above,  and  since  hi  and  h2  are  nonnegative,  it  is  clear  that  sign[v0 )  = 
sign(Bi  ■  Bj  •  4).  Since  v0  =  E  [df/dgij],  this  proves  the  lemma.  ■ 

Thus,  as  described  above,  our  algorithm  computes  ci;  =  stgn(E  [df/dg,j])  for  each  pair  of 
variables  x<  and  Xj.  Then,  for  each  triple  x<,  Xj  and  xt,  the  algorithm  tests  whether  ci;  ^  cjt  = 
Cjk.  If  it  finds  that  this  is  the  case,  then  the  algorithm  can  correctly  conclude  that  r<*  = 
and  therefore  T,,  occurs  deeper  than  this  gate. 

The  results  of  all  these  sign  tests  are  organized  in  a  directed  graph  G,.  The  nodes  of  G,  are 
unordered  pairs  {i,j},  for  i  ^  j.  For  each  ordered  triple  of  distinct  indices  i,j,k,  our  algorithm 
tests  if  Cij  /  Cu  =  Cjk.  If  this  is  the  case,  an  edge  is  directed  in  G,  from  {i,j}  to  {»,  k).  Thus, 
as  argued  above,  an  edge  is  added  to  G,  in  this  fashion  only  if  is  deeper  in  /  than  r,t. 
Moreover,  by  transitivity,  a  path  from  {i,j}  to  {i',  j'}  implies  that  r,j  is  deeper  than 

Note  that  the  sign  test  is  a  one-sided  test  in  the  sense  that  if  cy  =  cik  =  Cjk  then  nothing  can 
be  concluded  about  the  relative  depth  of  r,;  and  T,*.  However,  our  next  lemmas  give  conditions 
under  which  the  sign  test  and  its  graph  G,  are  guaranteed  to  give  such  depth  information. 

Lemma  7.3  Let  x,,  Xj  and  xk  be  distinct  variables  of  f,  and  assume  is  deeper  in  f  than 
r.t  =  r>t.  Then  ctJ  #  cik  if  and  only  if  E  [dgik/dgij]  <  0. 

Proof:  This  follows  immediately  from  the  chain  rule 

df_  =  df_'dg± 

@9ij  &9ik  dg,j 


which  implies  by  independence  that  ctJ  =  cik  ■  sign(E  [dgikj  dgij)). 
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Lemma  7.4  Let  xit  Xj  and  xk  be  distinct  variables  of  f,  and  assume  T^  is  deeper  in  f  than 
r,t  =  Tj*.  Assume  also  that  the  path  in  f  from  I\j  to  Tit  includes  a  gate  LlNiU,  with  w  <  0. 
Then  there  exists  a  path  in  G,  from  {i,  j}  to  {i,k}. 

Proof:  Let  {Ar}'=1  be  the  sequence  of  lin  gates  on  the  path  from  rtJ  to  and  suppose  that 
each  Ar  computes  the  function  LlNZrU;,. 

By  Lemma  7.1,  the  sign  of  dgn/dgij  is  given  by  the  sign  of  the  product  Thus,  if 

\\wr  <  0,  then  dgik/dgij  is  nonpositive,  and  so  by  Lemma  7.2,  an  edge  is  directed  in  G,  from 
{i,  j}  to  {t,fc},  and  the  lemma  holds  in  this  case. 

Otherwise,  wr  >  0.  Note  that  fj  is  nonzero  since  if  it  were  zero,  then  by  Lemma  7.1, 
df  Idii  would  be  zero  (since  the  path  from  T.j  to  T,*  is  a  subpath  of  the  path  from  xx  to  the 
output),  contradicting  the  assumption  that  is  influential. 

Thus,  fj  wr  >  0.  Since  we  assume  some  wr  <  0,  there  is  some  s  such  that  n*=i  wr  <  0  and 
nir=»+i  wr  <  0.  Since,  by  assumption,  no  lin  gate  is  the  input  to  another  LIN  gate,  there  must 
be  a  MUL  gate  separating  Ar  and  Ar+i;  since  this  MUL  gate  is  fed  by  xiy  it  can  be  written 
for  some  variable  xt.  Applying  the  first  part  of  this  argument  twice  (the  case  that  f]  wr  <  0), 
it  follows  that  there  must  exist  an  edge  in  G ,  from  {i,j}  to  {a, £},  and  another  edge  from  {:,  1} 
to  {i,k}.  This  proves  the  lemma.  ■ 

Thus,  the  sign  test  succeeds  in  determining  which  of  two  gates  r,;  and  occurs  deeper  in 
/  whenever  the  path  from  one  gate  to  the  other  includes  a  LINZU,  gate  with  w  <  0.  We  describe 
next  another  test  for  handling  all  other  situations. 

The  product  test 

The  second  test  is  called  the  product  test ,  and  it  has  a  flavor  similar  to  that  of  the  sign  test. 
We  will  be  interested  in  the  quantities  dijk  =  Bx  ■  Ajk.  Note  that  d,;*  =  dikj  always.  We  will  see 
that  if  r0  occurs  deeper  than  1?**,  then  d,;i  =  djik,  but  that  if  T<*  occurs  deeper  than  r,j  then 
djiie  is  apt  to  differ  significantly  from  d,jt.  Thus,  the  values  of  will  give  a  second  method 
for  determining  which  of  the  three  gates  T^,  and  Tjk  occurs  deepest  in  /. 

Lemma  7.5  Let  x, ,  Xj  and  xk  be  distinct  variables,  and  assume  TI;  occurs  deeper  in  f  than 
r,t  =  Tjk.  Then  dijk  =  djik. 

Proof:  We  can  decompose  /  in  terms  of  gik  as  /  =  u0  +  v0gik,  so  that  none  of  the  variables 
relevant  to  gik  are  relevant  to  the  functions  «o  and  v0.  Since  and  are  MUL  gates,  we 
can  express  the  function  gik  in  terms  of  the  subformulas  that  it  subsumes:  5,*  =  yy3.  Gate 
occurs  in  one  of  these  subformulas,  say  y,  and  xk  occurs  in  the  other  y3.  We  can  decompose  y 
and  y3  in  terms  of  gti  and  xk,  respectively,  so  that  y  =  u  +  vgtj,  and  y3  =  u3  +  v3xk.  Similarly, 
r„  is  a  MUL  gate,  so  we  can  write  gtj  =  yxy7  where  yx  and  y2  contain  and  x;-,  respectively. 
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/ 


Figure  4:  The  decomposition  of  /  used  in  Lemmas  7.5  and  7.6. 

Decomposing  yx  and  y2,  we  can  write  yx  =  ux  +  vxXi  and  y2  =  u2  +  v2x j.  (This  decomposition 
of  /  is  summarized  in  Figure  4.) 

By  a  direct  computation,  we  have  that: 

Bi  -  vv0vx y2y3 
B,  =  vv0yxv2y3 
Bk  =  yu0u3 

=  vv0vxv2y3 
Ajt  =  vv0ytv2v3 
Ait  =  vv0v1y2v3. 

Thus,  by  independence, 

dijk  =  V2  Vq  Vj  V2  1  Jf2  3/3  =  d;it- 
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As  defined  in  this  proof,  we  call  the  function  u  the  offset  of  g,k  with  respect  to  gtj,  and  we 
denote  it  offset^*,  <7^). 

Lemma  7.8  Let  xit  Xj  and  xk  be  distinct  variables,  and  assume  F,j  occurs  deeper  in  f  than 
r,t  =  rjt.  Then  | dijt  -  dkij |  >  (c/5n)3  •  |E  [offset^*, $0)]|. 

Proof:  With  the  same  set-up  as  in  the  preceding  lemma,  we  have  that 

dtij  =  vyvlvxv7v3y3. 

Thus,  since  y  =  u  +  vj/13/2, 

dtij  ~  dijk  =  a  ■  (y  -  vyij/2)  =  ou, 

where  a  —  vvgviv2v3y3.  Note  that  |a|  =  | Bk  •  Au/y |  >  (c/5n)3  since  |y|  <  1,  and  since  all  of 
the  variables  are  influential.  (It  was  shown  in  the  proof  of  Lemma  7.2  that  j  |  >  l-B.I  -  |Sj|  ) 
This  implies  the  lemma.  ■ 

We  use  dijk  =  B,  •  Ajk  to  estimate  dijk.  Then 

jdijk  -  d{jk\  <  jflijjAjt  -  A;i|  + 

<  2  e8/(9n)10 

using  the  fact  that  Bi  and  |Ajt|  are  both  bounded  by  1.  As  before  for  the  sign  test,  we 
maintain  a  graph  Gp  with  vertices  { * ,  j }  for  i  ^  j.  Edges  are  directed  from  {i,  j)  to  {i,k}  and 
also  from  {j,  A:}  to  {i,k}  for  all  ordered  triples  i,j,k  which  satisfy  j d^k  -  djik  <  4e®/(9n)10. 
Thus,  by  Lemma  7.5,  if  T,;  is  deeper  than  T,*  =  Tjt,  then  =  djik  and  so  edges  are  directed 
from  {i,j}  to  {i,k}  and  from  {j,  k}  to  {i,k}.  However,  the  product  test  is  one  sided  (though 
in  a  different  way  than  the  sign  test):  it  may  happen  that  such  edges  are  added  even  if  is 
above  rit.  Nevertheless,  by  Lemma  7.6,  this  can  only  be  the  case  if  |E  [offset^*,  is  quite 
small. 

Constructing  the  skeleton  from  G,  and  Gv 

Finally,  we  are  ready  to  show  how  an  “approximate”  skeleton  of  /  can  be  computed  from  the 
graphs  G,  and  Gp.  First,  to  reiterate  what  was  pointed  out  above,  an  edge  in  G,  from  {i,j}  to 
{*,  k}  indicates  that  must  be  deeper  than  Ta!  however  an  edge  in  Gp  between  these  vertices 
indicates  only  that  T,,  may  be  deeper  than  or  equal  to  Tj*  (although  it  might  not  be).  On  the 
other  hand,  if  r,;  is  deeper  than  or  if  T,;  =  r,t,  then  there  must  be  an  edge  in  Gp  from 
to  {i,k},  but  possibly  not  in  G,. 
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We  can  combine  Gp  and  G,  into  a  single  graph  Ge  on  the  same  vertex  set:  From  the 
comments  above,  it  follows  that  the  edge  set  of  G,  is  a  subset  of  the  edge  set  of  Gp.  The  graph 
of  Gc  is  obtained  from  Gp  by  deleting  all  edges  from  {i,j}  to  {i,k}  which  “contradict”  G ,, 
i.e.,  for  which  there  exists  a  path  in  G,  from  {i,  k)  to  Such  edges  can  be  removed  with 

impunity  since  the  reverse  path  in  G,  indicates  that  must  be  deeper  than 

It  is  easily  seen  that  if  I\j  is  deeper  than  rit,  or  if  rtJ  =  then  Ge  will  contain  an  edge 
from  to  {i,&}  since  the  corresponding  edge  in  Gp  will  not  be  deleted. 

Thus,  if  there  exists  a  path  in  Gt  from  {»,  j}  to  {k,£},  but  no  path  in  the  reverse  direction, 
then  we  can  conclude  that  r,;  is  deeper  in  /  than  r**.  The  problem  arises  when  there  exist 
paths  connecting  these  vertices  in  both  directions,  that  is,  when  {i,i}  and  {&,f}  are  in  the 
same  strongly  connected  component  of  Ge.  We  therefore  need  some  way  of  dealing  with  these 
strongly  connected  components. 

We  say  two  vertices  of  Ge  are  equivalent  (with  respect  to  Gt )  if  they  are  in  the  same  strongly 
connected  component. 

We  say  that  a  LIN  gate  of  /  is  trapped  if  it  is  immediately  fed  by  some  mul  gate  ,  and 
it  immediately  feeds  another  MUL  gate  Ft/  for  some  indices  t,  j,  k ,  l  which  are  such  that  (t.  j} 
and  {fc,f}  are  equivalent.  In  other  words,  the  LIN  gate  is  trapped  if  it  is  “surrounded”  above 
and  below  by  gates  whose  indices  are  in  the  same  strongly  connected  component  of  Gc.  We 
will  see  that  if  some  gate  LINZU,  is  trapped,  then  its  z-value  must  be  quite  small,  and  so  we  will 
be  able  to  approximate  such  gates  by  a  LIN0iU,  gate.  As  will  be  seen,  the  strongly  connected 
components  of  Gc  correspond  to  connected  regions  of  /  (where  we  view  /  as  a  graph  whose 
vertices  are  the  gates,  and  whose  edges  are  the  “wires”  connecting  the  gates);  thus  we  will  be 
able  to  approximate  these  connected  regions  by  simple  MUL  gates. 

Below,  we  let  0  =  (8  •  54/9l0)c4/rc6. 

Lemma  7.7  Suppose  A  is  a  trapped  LINZUJ  gate  of  /.  Then  w  >  0  and  z  <  0. 

Proof:  Since  A  is  trapped,  it  is  immediately  fed  by  some  mul  gate  I\; ,  and  it  immediately 
feeds  mul  gate  I\/  where  {i,j}  and  {&.£}  are  equivalent.  Let  T  be  the  set  of  vertices 
for  which  IY^  feeds  I\; ,  or  IY;<  =  IY,.  Then  {k,t}  £  T,  since  Tt;  feeds  rt/.  Since  there  exists 
a  path  from  {&,£}  to  { * ,  j } ,  there  must  be  an  edge  directed  from  one  vertex  T  to 

another  { i',  j' }  €  T.  Since  IYj<  feeds  or  is  equal  to  I\j,  the  path  in  /  from  IYj<  to  r,/*<  must 
include  gate  A. 

We  are  interested  in  examining  this  path  more  closely.  Let  h0,hi,. .  .,hr  be  the  sequence 
of  subformulas  subsumed  by  the  mul  gates  along  this  path  Thus,  h0  =  and  hr  = 
Further,  since  each  consecutive  pair  of  mul  gates  is  separated  by  some  gate  LlNZ|itt,,,  we  can 
write  ht  =  yt(zt  +  wth where  yt  is  a  subformula  subsumed  by  an  uncle  of  j,/,.,  for  1  <  t  <  r. 
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Note  that  no  wt  <  0.  If  it  were,  then  by  Lemma  7.4.,  there  would  be  a  path  f-om  to 

{i',k'}  in  G,,  and  so,  by  construction  of  Ge,  there  could  not  be  an  edge  in  Gc  from  {!',  k'}  to 
In  particular,  since  A  is  on  the  path  from  to  r,-.*/,  «,his  implies  w  >  0. 

We  can  decompose  the  subformula  zt  +  wtht_x  in  terms  of  ho  so  that  zt  +  u>tht-X  =  u,  +  vth0. 
Thus,  ht  =  yt(ut  +  vtho)-  Note  that  u,  =  offset (ht,h0).  Expanding,  this  implies  for  t  >  2  that 

ht  =  Vt  (zt  +  Wtht-i) 

=  Vt{zt  +  wtyt-i{ut_x  +  vt-xh0)) 

=  yt(zt  +  wtyt-iut-i  +  wtyt-iVi-ih0) 

=  yt(ut  +  v,ho). 


Thus, 


ut  =  zt  +  wtyt-iut.i 


for  t  >  2.  For  t  —  1,  we  have  that  hx  =  yi(ui  +  vxh0)  =  y\(zx  +  wxh0)  so  ux  =  zx.  By  a 
straightforward  induction  on  r,  it  follows  that 


offset  (hr,  h0)  =  Ur  =  V  (  z,  •  TT  Wtyt-X  ] 

1=1  V  «**+!  / 


Since  each  u?,  >  0,  this  sum  is  at  least  z,  Oi=»+i  wtVt~i  f°r  any  5.  Note  that,  by  Lemma  7.1, 
n*=#+i  wtVt- i  >  \dffdxi>\  since  the  path  in  /  from  xx>  to  the  output  includes  the  path  from 
TVi.  to  Thus,  by  Lemma  7.6  and  our  criterion  for  adding  edges  to  Gp,  and  since  x,<  is 

influential, 

f  71  \ 3  8f® 

2>-  <  |E[offset(/ir,fto)]|  <  (T) 

Since  z  —  z,  for  some  s,  this  proves  the  lemma.  ■ 

Let  /'  =  fu  be  the  real  formula  obtained  from  /  by  replacing  all  trapped  gates  linzu,  by 
LINqu,.  Then  /'  is  a  good  approximation  of  /: 


Lemma  7.8  Let  f  and  f  =  fu  be  as  above.  Then  \f(x)  —  f'(x) |  <  2nd  for  all  inputs  x. 


Proof:  We  show  by  induction  on  the  height  of  /  that  \  f  —  f'\  <  s0  for  all  inputs,  where  s  is  the 
number  of  LIN  gates  occurring  in  /.  We  prove  the  lemma  in  a  slightly  stronger  form  for  any  /' 
obtained  from  /  by  replacing  each  linzu/  gate  by  a  lin2<u/  gate  for  any  z'  satisfying  | z  —  z' |  <  0; 
also,  z'  must  be  such  that  z'  and  w  +  z'  are  in  the  range  [0, 1]  so  that  f  is  a  real  formula.  Note 
that  the  f  of  the  hypothesis  satisfies  these  conditions  by  Lemma  7.7. 

If  /  =  x,-,  then  /'  =  x,  =  /,  and  the  lemma  holds. 

If  /  =  LlNltt,(y),  then  f  =  LlNt.w(y')  for  some  z'  and  y'  satisfying  | z  -  z'\  <  0  and,  by 
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Input:  the  graph  Gc 

Output:  a  skeleton  that  approximates  / 

Procedure: 

1  G*-Gc 

2  F  <—  {lin(x,)  :  i  <  t  <  n} 

3  repeat  while  G  is  not  empty 

4  find  a  strongly  connected  component  H  of  G  with  no  incoming  edges 

5  E  *—  {(T  €  F  :  {*,  j }  €  H  and  x,-  €  rel(<r)} 

6  F  *-  (F  -  E)U  {LIN(MUL<,e£cr)} 

7  G  *-  G  -  H 

8  end 

9  output  the  only  member  of  F 

Figure  5:  An  algorithm  for  inferring  a  good  skeleton  from  Ge. 
inductive  hypothesis,  |y  -  y'\  <{s-  1)0.  Then  on  all  inputs, 

1/  -  /'I  =  \z~  z'  +  w(y  -  y')l 

<  \z-  z'\  +  |t/?||y  -  y'| 

<  s8 

since  |u;|  <  1. 

If  /  =  MUL(yi,y2)  =  yiV2,  where  and  s2  LIN  gates  occur  in  yx  and  y2,  respectively,  then 
=  s,  and  /'  =  y'jyj  where  |yj  —  y'|  <  s<0  for  i  =  1,2.  Then  on  all  inputs, 

I/- /'I  =  \y1y2  -  vWil 

=  \yiy?  -  yiy'2  +  yxy?  -  y[y'2\ 

<  \y1Wy2  -  y'2\  +  MWyi  -  y[\ 

<  s20  4-  =  sff. 

This  completes  the  induction. 

Since  /  has  at  most  2n  lin  gates,  this  also  proves  the  lemma.  ■ 

Finally,  we  give  an  algorithm  that  infers  the  skeleton  of  a  formula  h  that  equals  /'.  (That 
is,  h  may  differ  syntactically  from  /',  but  it  is  functionally  equivalent.)  The  algorithm  is  shown 
in  Figure  5.  The  algorithm  maintains  a  family  of  skeletons  F.  On  each  iteration  of  the  loop, 
several  of  these  skeletons  are  combined  into  one.  We  will  show  that  only  one  skeleton  remains  in 
F  upon  termination.  Also,  although  the  mul  operator  is  technically  binary,  the  multiple-input 
product  used  at  line  6  can  be  replaced  in  the  obvious  manner  by  a  tree  of  mul  gates.  The  set 
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rel(a)  for  skeleton  or  subformula  a  is  the  set  of  variables  relevant  to  a.  Finally,  G  —  H  at  line  7 
denotes  the  graph  obtained  from  G  by  deleting  all  vertices  in  H ,  and  any  edges  incident  to 
these  deleted  vertices. 

We  say  a  condition  holds  at  all  times  if  it  holds  between  each  iteration  of  the  main  loop. 
The  algorithm  clearly  halts  since  some  vertex  of  G  is  removed  on  each  iteration. 

Lemma  7.9  At  all  times,  if  ax  and  <r2  are  distinct  members  of  F,  then  rel(<Ti)  H  rel(o2)  =  0- 
AISO,  U„€Frel(<0  =  {*!,...,*»}• 

Proof:  Using  the  fact  that  rel(LlN(MUL(T€£<7))  =  Uwefi  reK47)*  this  follows  by  an  easy  induction 
on  the  number  of  iterations  of  the  main  loop.  ■ 

Lemma  7.10  At  all  times,  if  {i,j}  is  not  a  vertex  of  G  then  {x,-,  x,}  C  rel(<r)  for  some  o  €  F. 

Proof:  Note  that  {t,  j}  is  removed  from  G  at  line  7  only  if  Xj  and  Xj  are  relevant  to  skeletons 
in  E.  ■ 

Lemma  7.11  Upon  termination  of  the  main  loop,  |F|  =  1. 

Proof:  If,  upon  termination,  F  contained  two  skeletons  ox  and  ct2,  then  each  must  have 
distinct  relevant  variables  ij  and  Xj,  respectively,  by  Lemma  7.9.  By  Lemma  7.10,  this  implies 
{*,j}  €  G,  contradicting  the  fact  that  G  is  empty.  ■ 

Lemma  7.12  At  all  times,  if  o  £  F  then  a  is  the  skeleton  of  some  formula  which  equals  a 
subformula  of  /'. 

Proof:  By  induction  on  the  number  of  iterations  of  the  main  loop.  Initially,  the  lemma  holds 
trivially. 

Consider  the  state  of  the  algorithm  immediately  before  G  is  modified  at  line  7.  Let  E  = 
{tr !,..., err}  be  as  in  the  algorithm,  and  let  o  =  LIN(MULj=1cTt).  Let  {k,£}  6  H  be  such  that 
rt/  is  a  gate  of  minimal  depth  in  the  set  {r^-  :  {i,j}  €  H}.  (That  is,  T**  does  not  feed  any 
gate  in  this  set.) 

Claim:  rel(<r)  =  rel(jt*). 

Proof  of  claim:  Note  first  that,  by  definition  of  E, 

rel(<r)  =  {x<  :  {t,j}  €  H }. 

Thus,  if  Xj  6  rel(o)  then  {*,  j}  €  H  for  some  j.  Let  T  be  the  set 


{{*%  J#}  €  G  :  Ti'j>  feeds  T**  or  IV^  =  T*/}. 
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We  will  show  that  {*,  j}  G  T\  this  implies  that  x,  is  in  the  subformula  subsumed  by  5*/,  and 
thus  x,-  G  rel(  <?*/). 

If  {*,  j}  £  T ,  then  since  {k,£}  G  T  and  since  {i,j}  and  {it,  £}  are  equivalent,  there  must  be 
an  edge  in  H  from  some  node  {i\j'}  &  T  to  some  other  node  {»',&'}  G  T.  Then  IV*,  must  feed 
IV,, ,  and  T*/  must  be  on  the  path  in  /  from  IV*,  to  IV,-,.  Thus,  IV*,  feeds  (or  is  equal  to)  Tt/, 
which  in  turn  feeds  T,,,-,.  Thus,  there  exist  paths  in  Ge  from  { t',  k'}  to  {k,  (},  and  from  {k,£} 
to  Since  an  edge  is  directed  from  to  {»',&'},  this  implies  that  is  in  H. 

However,  this  contradicts  the  definition  of  {fc, f}  since  {k,£}  is  deeper  than  {{',  j'). 

Thus,  rel(<r)  C  rel(0*/). 

Conversely,  suppose  x,-  G  rel(p**).  Then  IV  feeds  or  is  equal  to  T*/.  Thus,  there  is  a  path 
in  Ge  from  {t,/}  to  {Ar,/}.  Since  H  is  assumed  to  have  no  incoming  edges,  this  implies  that 
either  {i,f}  €  H  or  {i,f}  £  G.  If  {i,^}  £  G,  then  by  Lemma  7.10,  Xj  G  rel(a)  since  xt  €  rel(o). 
If  {i,f}  G  H,  then  x,  G  rel(<7)  by  construction.  Thus,  rel(<r)  =  rel(jf*/),  proving  the  claim. 

The  gate  T**  in  /'  immediately  feeds  some  lin  gate.  Let  h  be  the  subformula  of  /'  subsumed 
by  this  LIN  gate.  We  will  show  that  a  is  the  skeleton  of  a  formula  that  equals  h. 

By  inductive  hypothesis,  each  at  is  the  skeleton  of  a  formula  equal  to  some  subformula  h, 
of  /'.  The  above  claim  implies  that  each  ht  is  in  fact  a  subformula  of  h.  Thus,  h  can  be  written 
a s  h  =  g(hi,  ...,hr)  for  some  read-once  real  formula  g.  That  is,  g  is  what  remains  of  h  when 
each  ht  is  replaced  by  a  meta-variable. 

Consider  a  gate  T,,  in  h  that  does  not  belong  to  any  of  the  subformulas  h,  (and  thus  that 
remains  in  g).  Since  Tj,  is  in  h,  x,  and  x,  are  in  rel(<7),  from  the  above  claim.  However,  since  T,, 
is  not  in  any  of  the  subformulas  ht,  xt  and  Xj  must  be  in  different  subformulas;  thus,  {i,jf}  G  G 
by  Lemma  7.10.  Since  T,-,  feeds  or  is  equal  to  T**,  there  exists  a  path  in  Gc  from  {i,j}  to  {&,/}. 
Thus,  { i ,  j  }  G  H. 

This  implies  that  every  internal  UN  gate  of  g  is  trapped  since  {i,j}  G  H  for  every  MUL  gate 
r<,-  of  g.  Thus,  if  LINJU,  is  an  internal  gate  of  g  then  z  =  0  by  construction  of  /',  and  the  LIN 
gate  is  simply  multiplying  its  input  by  a  constant.  Since 

MULfLINo.uJyi^LINo.uJjft))  =  LIN0  ,u/ii0a  (MUL(y!,y2)), 

this  implies  that  all  of  the  lin  gates  of  g  can  be  “pulled  to  the  top.”  In  other  words,  we  can 
write 

h  =  LlN<u,(MULj_j/if) 

for  an  appropriate  choice  of  z  and  w.  This  completes  the  lemma.  ■ 

Combining  Lemmas  7.9, 7.11,  and  7.12,  it  follows  immediately  that  the  algorithm  of  Figure  5 
outputs  a  skeleton  of  a  formula  equal  to  /'.  The  algorithm  clearly  runs  in  polynomial  time. 


98  Statistical-perturbation  Methods  for  Inference  of  Read-once  Formulas 


3-7.5  Stage  III:  Inferring  the  skeleton’s  z,w  values 

In  the  final  stage,  our  algorithm  “fills  in”  the  missing  z,w  values  of  the  skeleton  inferred  in 
Stage  II.  Recall  that  this  skeleton  a  was  for  a  function  fu  which  closely  approximates  the 
target  /.  We  would  like  to  view  f\\  as  the  target  since  we  have  its  skeleton  to  work  with.  The 
problem  is  that  we  have  no  way  of  sampling  according  to  f\\  —  we  can  only  sample  using  the 
target  we  have  been  provided  with,  and  f\\  may  differ  slightly  from  the  target. 

However,  it  turns  out  that  this  is  not  a  problem  since  /  and  fu  are  so  close  to  one  another. 
We  showed  in  Lemma  7.8  that  |/(x)  —  /n(x)|  <  2 nO  on  all  inputs  z.  Thus,  for  any  distribution 
on  the  inputs  (such  as  the  filtered  distributions  used  for  estimating  the  expected  values  of  the 
partial  derivatives),  the  expected  value  of  /  is  within  2nd  of  the  expected  value  of  fu-  Thus, 
we  can  view  fu  henceforth  as  the  target  concept  /,  taking  into  account  the  fact  that  all  of  our 
statistical  estimates  are  off  by  an  additional  factor  of  2nd.  In  particular,  for  all  i,  j, 

\Bi-Bi\  <  c8/(9n)10  +  4n0  <  ae'/n* 

|it>  -  y4,j|  <  c8/(9n)10  +  8 nO  <  ae*/ns 

where 

a  =  (64  ■  54  +  l)/910. 

Note  that  f?,  now  refers  to  df/dxi  for  /  =  fu  but  the  estimates  Bt  are  unchanged;  likewise  for 
Aij.  Thus,  for  this  new  target  formula,  influential  variables  are  such  that  Bi  >  c/4 n  -  ae4/n5  > 
c/5  n. 

Our  algorithm  uses  these  values  to  estimate  the  z,  w  values  of  the  LIN  r;ates  of  /.  The 
algorithm  computes  the  z ,  w  values  from  the  bottom  up:  the  algorithm  visits  each  gate  of  the 
skeleton,  starting  with  those  nearest  the  input  level.  No  gate  is  visited  until  all  those  that  feed 
it  have  been  visited.  When  some  gate  A  is  visited  which  subsumes  subformula  g,  our  algorithm 
computes  a  formula  that  approximates  not  the  function  g  itself,  but  rather  the  function  g/g. 
(Apparently,  g/g  is  easier  to  approximate  than  g.)  This  is  done  for  all  gates  except  the  output- 
level  gate  where  an  approximation  of  the  function  /  itself  is  computed. 

Although  not  the  case  for  the  skeleton  computed  in  Stage  II,  we  nevertheless  assume  without 
loss  of  generality  that  the  gates  of  a  occur  in  alternating  layers  of  mul/lin  gates,  with  a  lin 
gate  at  the  output,  and  every  variable  an  input  to  a  lin  gate. 

Suppose  that  our  algorithm  is  visiting  some  gate  of  er,  and  assume  that  the  corresponding 
gate  A  of  /  subsumes  subformula  g.  There  are  five  cases  to  consider: 

1.  A  is  a  mul  gate,  and  thus  A  is  immediately  fed  by  a  lin  gate,  and  immediately  feeds  a 
lin  gate; 
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2.  A  is  a  LIN  gate  immediately  fed  by  a  variable,  and  A  is  not  the  output  gate; 

3.  A  is  a  LIN  gate  immediately  fed  by  a  mul  gate,  and  A  is  not  the  output  gate; 

4.  A  is  a  LIN  gate  immediately  fed  by  a  variable,  and  A  is  the  output  gate;  and 

5.  A  is  a  LIN  gate  immediately  fed  by  a  MUL  gate,  and  A  is  the  output  gate. 


When  A  is  not  the  output  gate  (i.e.,  cases  1,  2  and  3),  we  show  inductively  how  to  find  an 
approximation  h  of  h  =  g/g  so  that  E  ||h  -  h|]  <  sr/g  where  s  is  the  number  of  gates  in  g, 
and 

r  =  e/12  n2. 

We  will  find  it  useful  in  case  1  to  also  prove  as  part  of  our  inductive  hypothesis  that  E  ||h  —  h|j  < 
5.  When  A  is  the  output  gate  (cases  4  and  5)  we  show  how  to  find  an  approximation  /  of  g  =  /, 
so  that  E  || /  -  /|]  <  e/4. 


Case  1 

In  this  case,  A  is  a  MUL  gate.  Thus,  we  can  express  g  as  a  product  of  its  inputs,  g  =  gxg2.  We 
assume  inductively  that,  for  i  =  1,2,  approximations  h,  are  available  for  ht  =  &/</,  with  the 
property  that 

E  [|ft,  -  h, |]  < 

Here,  s<  is  the  number  of  gates  in  gt,  so  that  s  =  sx  +  s2  +  1  is  the  number  of  gates  in  g. 

We  use  hi  and  h2  to  approximate  h  =  g/g.  Clearly,  by  independence,  h  =  g/g  =  gigi/g\gi  = 
hih2,  so  we  let  h  =  hih2  in  this  case. 

Then  the  average  error  of  h  can  be  computed  as  follows: 


E 


=  e[|m2-m2|] 

=  E  [\h  1(h2  -  h2)  +  h2(hi  -hi) -(hi-  hi)(h2  -  h2)|] 
<  E  [\hi\\h2  -  h2|  4-  \h2\\hi  -  hi\  +  \hi  -  hi\\h2  -  h2|] . 


Note  that  hx  =  h2  =  1,  and  that  hi  and  h2  are  nonnegative  functions.  Thus,  by  independence, 
and  by  our  inductive  assumptions  about  the  accuracy  of  hi  and  h2,  it  follows  that  E  ||h  —  h|] 
is  bounded  by 


Sir  s2r  sLs2r 


+ 

9i  9t 


9i92 


Since  g  =  gxg2  <  gt  for  i  =  1,2,  this  is  at  most 


(7.6) 


Si  t  +  s2r  +  Sis2r2 


< 


aXT  +  s2r  +  r 


9 


9 
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/ 


Figure  6:  The  decomposition  of  /  used  in  Case  2. 

since  each  s<  <  3n,  and  since  r  <  1/9 n2.  This  gives  the  desired  bound  of  sr/g  for  Case  1. 

Alternatively,  note  that  gx  >  e/bn  since  gi  is  subsumed  by  am  uncle  of  some  influential 
variable  Xy  relevant  to  g2,  and  by  Lemma  7.1.  Thus,  since  <  3n,  SiT/gx  <  5/4  by  our  choice 
of  r.  Similarly,  s2r/g2  <  5/4,  and  therefore,  the  bound  given  in  equation  (7.6)  implies  that 
E  [\h  -  h|]  <  65/16  <  5. 

Case  2 

In  this  case,  A  is  a  LlNiU)  gate  immediately  fed  by  some  variable  x,-  so  that  g  =  z  +  tnx,. 
Also,  A  is  an  input  to  some  mul  gate  T,y  for  some  index  j.  Thus,  <7,;  =  ggx  for  some  subformula 
<7i  to  which  Xj  is  relevant.  We  can  decompose  g\  in  terms  of  Xj,  and  write  g1  =  Uj  4-  ViXj. 
Similarly,  /  =  u0  +  v0 g^  for  /’ s  decomposition  in  terms  of  <7^.  (This  decomposition  of  /  is 
shown  in  Figure  6.) 

Thus,  /  =  u0  +  v0ggi  =  «o  +  v0(x  -I-  wxi)(u  1  +  vix,),  so 

Bj  =  VoVi(z  +  wij) 

Aij  =  v0VjW. 

Also,  (Bj\Xi  <-0)  =  v0«i z.  Let  a  =  E[5;|xj  *-0],  0  =  A and  7  =  Bj.  Then  by  independence, 
a  =  v0t)i2,  0  =  v0viw,  and  7  =  UoV^x  +  trp,).  (Recall  that  pt  is  the  probability  that  bit  x,  is  1.) 
Let  a,  0  and  7  be  respective  estimates  of  these  quantities.  (The  quantity  a  can  be  estimated 
using  the  fact  that  a  =  E((/|x<  <-0,x;  ♦- 1)  -  (/|x,  «-0,x;- <— 0)].  Thus,  |a  -  a|  <  ac'4/n8.) 
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Then  it  can  be  seen  that 

Q  +  (3ij  _  z  4-  wxj  _  g  _ 
7  z  +  wp,  ~  g~ 

so  let  h  =  (d  +  /3*»)/7  in  this  case. 

The  next  fact  will  be  useful  for  analyzing  the  error  of  h. 

Lemma  7.13  For  b  and  b  nonzero, 


Proof: 


da  a  a  a 

b  b  ~  J  I +  6 


a 

<li 

- 

ii  + 

ab  -  ab 

b 

b 

bb 

We  are  now  ready  to  analyze  the  error  of  h.  First,  since  x,  is  influential,  |7|  >  (c/5n).  By 
Lemma  7.13, 

_  r  r  ,n  ^ «  r|a-a  +  (4-/?)x<  I7-7I 

e1'hhe[1 — - + \h\ •  IJP  • 

Clearly,  h  is  nonnegative  and  h  =  1.  Thus, 


E[|M1  S  ^  ■  (|q  -  o|  +  |4  -  +  I7  -  7{) 

5n  3oc4 

-  T'nr 

15063  T 

=  - 7-  <  r  <  - 

n4  g 

since  g  <  1.  Since  g  contains  one  gate,  this  satisfies  our  inductive  hypothesis  in  case  2. 


Case  3 

In  this  case,  A  is  a  UNZUJ  gate  which  has  as  input  some  mul  gate  r,;  so  that  g  =  2  4-  wg^. 
Then  gy  can  be  expressed  as  a  product  of  its  inputs  <j,;  =  g^g2,  where  x<  and  Xj  are  relevant  to 
9\  and  g2 ,  respectively.  Decomposing  gi  and  g2 ,  we  can  write  g}  =  m  +  u,Xi  and  g2  =  u2  +  v2xr 
Gate  A  immediately  feeds  some  mul  gate  T,*.  We  can  write  gik  as  a  product  of  its  inputs 
9n  —  99zi  where  53  is  some  subformula  which  contains  xk.  Decomposing  <73,  we  can  write 
53  =  «3  +  v3xk.  Finally,  we  can  decompose  /  in  terms  of  gh  so  that  /  =  u0  +  v0gtj.  (All  this  is 
summarized  in  Figure  7.) 
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X  i  Xj 

Figure  7:  The  decomposition  of  /  used  in  Case  3. 

Thus,  we  have  that 

Bi  =  v0vxwg2g3 
Bk  =  v0{z  +  wgxg2)v3 
Aij  =  v0viwv2g3 
Ajk  =  v0v3wgiv2. 

Let  a  =  Bi  •  Ajk,  and  let  0  -  Bk  ■  Aij.  Then,  by  independence,  and  since  g,j  =  gig2, 

a  =  vlvxv2v3wg3(wgij ) 

0  =  vlvxv2v3wgz{z  +  w§ij). 

Let  d  =  Bi  ■  Ajk  and  0  =  Bk  •  Aij. 

We  assume  inductively  that  an  approximation  h0  is  available  for  h0  =  gij/gij.  We  assume 
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that  E  |h0  —  h0  <  min(5,(s  -  1  where  s  is  the  number  of  gates  occurring  in  g. 

Let  7  =  a/0,  and  let  7  =  a/0.  Then  7  =  w§ij/(z  +  wg^)  =  wg^/g.  Thus, 

^  =  9/9  =  1  “  7  +  7^o- 


Naturally,  then,  w'e  let 


h  =  1  —  7  -f  7h0 


in  this  case.  We  show  that  h  has  the  desired  accuracy. 

First, 

|a  -oj  <  |i?t  \Aik  -  j  +  |  <  2 ac4/n5. 

Similarly,  J/3  —  /?|  <  2 ae'/n5. 

Note  that  (7 1  =  |u’|  •  §ij/g  <  1  /g.  Also,  since  all  the  variables  are  influential.  \{3\  >  (c/5 n)3 
(since  |Ajj|  >  thus. 

|/3|  >  (c/5 n)3  -  2 ac4/n5  >  (5-3  —  2a)(c/n)3. 

So,  by  Lemma  7.13.  we  have 

|7-7l  <  ~(|d-a|  +  b|]/3-/?|) 

r ' 


<  1  /n\3  /!<*  -  a|  +  4  ~  g|\ 

5*3-2a  ((/  ^  3  j 


<  i.(  *<-  .  .i. 

<7  \5-3-2a  n2 

T 

<  — . 

~9 


So  the  error  of  h  can  be  computed  as  folkws: 

E  [J/i  —  A  =  E  [|7  -  7  +7/10  -  7h0|j 

<  |7  -  7 1  +  |7  '  7l  •  E[|h0|]  +  l7l  •  E  [| h0  -  /»0|)  +  |7  -  51  •  E  [| h0  -  A0|]  . 

Since  E[|/i0|]  =  1,  and  by  our  inductive  assumption  on  the  error  of  h0,  this  is  at  most 

~|7  ”  7l  +  (7l  ■  E  [|h0  —  h0|]  <  ~|7  ~  7l  +  ■  —  --1)~ 

u  11  9  9ij 

^  t  +  (s  -  l)r 

<  - r -  =  sr/g 

9 
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as  desired. 


Figure  8:  The  decomposition  of  /  used  in  Case  5. 


Case  4 

In  this  very  easy  case,  /  =  LINJU,(xi)  for  some  z,  w.  Since  /  =  z  +  wxx,  Bx  =  Bx  =  w.  Let 
w  =  Bx,  and  let  z  be  an  estimate  of  z  =  E[/|xi «—  0],  (Recall  that  an  estimate  of  this  latter 
quantity  was  made  in  calculating  Bx.  Thus.  \z  —  z |  <  ae4/n5.)  Finally,  let  /  =  z  +  wxx.  Then 

E  [|/  -  /  <  |-  -  i|  +  !«•’  -  <  2 ae4/n5  <  e/4 

as  desired. 


Case  5 

In  this  case,  A  is  a  linju,  gate  computing  the  formulas  output  /.  Gate  A  receives  its  input 
from  some  mul  gate  Thus,  /  =  z  +  wgXj,  and  gl}  computes  the  product  g,j  =  gxg2,  where 
variables  x,  and  x;  are  relevant  to  gx  and  g2,  respectively.  (Refer  to  Figure  8.) 

We  assume  inductively  that  an  approximation  h0  has  been  computed  for  h0  =  gij/g<j, 
and  that  the  accuracy  of  h0  is  such  that  E  J  h0  -  h0j  <  min(5,(s  -  l)r/p,j)  where  s  is  the 
number  of  gates  occurring  in  /.  We  wish  to  use  h0  to  compute  an  approximation  /  so  that 
E[|/-/|]  <  e/4. 

We  can  decompose  each  subformula  gt  so  that  gx  =  ux  +  vxxx  and  g2  =  it2  +  v2 xj-  Then  we 
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have: 

Bi  =  wvxg2 
Bj  =  wv2g1 
Aij  =  wviv2. 

Let  a  =  /,  and  let  0  =  B{  ■  Bj/Aij.  Let  0  =  Bi  •  and  let  a  be  an  estimate  of  /  so  that 

|a  —  d|  <  e/24 n2  with  probability  at  least  1  —  6/4.  Such  an  estimate  is  easily  computed  since  / 
is  just  the  probability  that  a  positive  example  is  received  from  the  examples  oracle.  (Actually, 
we  can  only  sample  according  to  the  “true”  target  formula,  rather  than  the  one  derived  in 
Stage  II,  and  here  regarded  as  the  target.  The  additional  error  introduced  by  this  fact  has  been 
folded  into  the  stated  accuracy,  as  was  done  for  our  estimates  Bi,  etc.) 

Note  that  a  =  z  +  wg,j,  and  0  —  wg\g2  =  wg{j.  Thus,  it  is  easily  seen  that 

f  -  (a-  0)  +  0ho, 

and  we  let 

/  =  (or  —  0)  +  0ho. 

We  show  that  /  has  the  desired  accuracy. 

First,  note  that 

|  BiBj  -  5,0,  |  <  \b,\\b,  -Bj\  +  \B,\\B,  -  5, 

<  2  ae4/n5. 

Clearly,  \0\  <  l.so|A,j|  >  |Z?,  ||.B;  |  >  (e/5n)2.  Thus,  >  (e/5n)2-flf4/n5  >  (5 _2-a)(c/n)2. 
Thus,  by  Lemma  7.13. 

\0-0\  <  p-T  '  ( | & 5,  -  Bi Bj  |  +  \0\ | i,;  -  Ay  |) 

3a  e2 
“  5~2  -  a  n3 ' 

Since  \0\  <  1.  we  assume  without  loss  of  generality  that  |/?j  <  1.  Thus,  the  average  error  of  / 
can  then  be  computed  as  follows: 

E[|/-/|]  =  E[\a-a  +  i3-3  +  0ho-j3ho\] 

<  E  |a  -  d|  +  0  —  +  j/J  —  /3 j |/io|  +  |/3|  ho  —  ho |  +  |/3  ■*  0  j 
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< 

< 

< 


|o  — a|  +  7|/3-/3|  +  |£|.E[|/i0- >*o|] 

e  21ar  , 

w  +  +  >T 

st  <  3 nr  <  c/4 


as  desired.  Here  we  have  used  the  fact  that  E[|/i0|]  =  1,  and  our  inductive  bound  on  the  error 
of  k0- 

Combined  with  the  previous  cases,  this  completes  the  induction,  anu  proves  that  the  final 
computed  hypothesis  /  has  error  at  most  E  [|/  —  /  J  <  c/4. 

3-7.6  Putting  it  all  together 

Thus,  we  showed  in  Stage  I  how  to  eliminate  sticky  and  uninfluential  variables  from  the  target 
formula  /,  yielding  another  formula  f\  with  E[|/  -  /i|]  <  2c/3. 

In  Stage  II,  we  regarded  f\  as  the  target,  and  showed  how  to  find  an  approximate  skeleton 
a  of  fi\  in  particular,  we  showed  that  there  exist  z,  w  values  for  a  which  result  in  a  formula 
fu.  This  formula  is  very  close  to  /j;  we  showed  that  |  f\  -  fu\  is  much  smaller  than  c/12  on  all 
inputs,  and  so  certainly  E  [|/i  -  /n|j  <  c/12. 

Finally,  in  Stage  III.  we  regarded  fu  as  the  target,  causing  a  slight  degradation  in  the 
accuracy  of  our  estimates.  Nevertheless,  we  described  a  technique  for  approximating  the  z,  w 
values  for  fu,  giving  a  final  formula  /  with  E  ||/-  /n|j  <  c/4.  It  follows  immediately  that 
E[|/-/|]  <cas  desired.  In  other  words.  /  is  an  c-good  model  of  probability  for  /. 

All  of  the  operations  described  are  clearly  polynomial  time,  and  the  sample  size  is  also 
polynomial.  The  sample  was  needed  to  estimate  the  probability  that  each  bit  is  1.  and  the 
expected  value  of  /  when  zero,  one  or  two  unsticky  variables  are  hardwired  to  fixed  values.  For 
the  accuracy  needed,  we  can  use  Chernoff  bounds  (Lemma  2-3.6)  to  show  that  a  sample  of  size 
0((n22/c18)  •  log(n/£))  suffices. 

After  the  sample  is  drawn  our  algorithm  records  the  obtained  estimates  of  these  values.  All 
of  the  remaining  operations  of  the  algorithm  take  negligible  time  compared  to  the  time  needed 
to  compute  and  record  these  values  from  the  (unfortunately  quite  large)  sample. 

Thus  we  have: 


Theorem  7.14  There  exists  an  algorithm  with  the  following  properties:  Given  e,6  >  0  and 
access  to  random  examples  chosen  according  to  a  product  distribution  and  classified  randomly 
according  to  some  read-once  real  formula  f,  the  algorithm  outputs  an  e-good  model  of  probability 
for  f  with  probability  at  least  1  -  b.  The  sample  size  needed  by  the  algorithm  is  0((r?22/c18)  • 
log(n/6)),  and  its  running  time  is  0((n24/e18)  •  log(n/<5)). 

From  the  comments  made  at  the  beginning  of  this  section,  we  have  as  corollary: 
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Corollary  7.15  There  exists  a  polynomial-time  algorithm  that  PAC-learns  the  class  of  read- 
once  Boolean  formulas  against  any  product  distribution. 


3-8  Conclusion  and  open  problems 

In  this  chapter,  we  have  described  polynomial-time  algorithms  for  learning  various  classes  of 
read-once  formulas  in  a  number  of  settings.  Our  algorithms  are  based  on  a  simple  statistical 
method  of  observing  the  formula’s  behavior  under  various  perturbations  of  the  target  distribu¬ 
tion. 

The  main  open  question  is  to  determine  how  far  this  apparently  powerful  method  can  be 
extended.  In  particular,  can  it  be  applied  to  formulas  which  are  not  read-once?  Can  it  be 
extended  beyond  product  distributions?  Are  there  other  classes  entirely  different  to  which  it 
might  be  extended,  such  as  decision  trees,  or  finite  automata? 

Considering  the  classes  described  in  this  chapter,  can  the  algorithms  described  be  improved? 
(It  seems  that  this  should  certainly  be  the  case  for  the  algorithm  of  Section  3-7.)  Turning  the 
question  around,  can  we  find  good,  non-trivial  lower  bounds  for  these  problems?  It  is  unclear 
what  such  a  lower-bound  proof  would  look  like,  especially  since,  in  the  PAC  model,  much  smaller 
sample  sizes  are  known  to  suffice  in  a  computationally  unbounded  setting.  (This  follows,  for 
instance,  from  Occam’s  Razor  of  Blumer  et  al.  [13].) 

Finally,  can  our  algorithms  be  extended  to  the  so-called  two-oracle  model?  In  the  one- 
oracle  model  (considered  exclusively  in  this  chapter),  the  learner  receives  both  positive  and 
negative  examples  from  a  random  source  of  examples,  and  must  perform  well  as  measured 
against  this  single  distribution.  In  the  two-oracle  model,  the  learner  has  two  random  sources 
of  examples,  one  that  provides  just  positive  examples,  and  the  other  providing  just  negative 
examples.  The  learner  must  perform  well  against  both  distributions  (i.e.,  both  the  distribution 
on  the  positive,  and  the  distribution  on  the  negative  examples).  In  the  distribution-free  model, 
Haussler  et  al.  [38]  show  that  these  two  models  are  equivalent.  However,  their  proof  falls 
apart  when  the  form  of  the  target  distribution  is  restricted.  Kearns  et  al.  [51]  and  Pagallo 
and  Haussler  [66]  have  shown  that  read-once  formulas  in  disjunctive  normal  form  are  efficiently 
leamable  against  the  uniform  distribution  in  the  two-oracle  model.  Can  the  results  in  this 
chapter  be  similarly  extended? 


Chapter  4 


Efficient  Distribution-free  Learning 
of  Probabilistic  Concepts 


4-1  Introduction 

Consider  the  following  scenarios: 

A  meteorologist  is  attempting  to  predict  tomorrow’s  weather  as  accurately  as  possible.  He 
measures  a  small  number  of  presumably  relevant  parameters,  such  as  the  current  temper¬ 
ature,  barometric  pressure,  and  wind  speed  and  direction.  He  then  makes  a  forecast  of 
the  form  “chances  for  rain  tomorrow  are  70%.”  The  next  day  it  either  rains  or  it  does 
not  rain. 

A  statistician  wishes  to  compile  an  approximate  rule  for  predicting  when  students  will  be 
admitted  to  a  particular  college.  There  are  some  students  whose  record  is  so  strong  they 
will  be  accepted  regardless  of  which  admissions  officer  reviews  their  file;  similarly,  there  are 
others  who  are  categorically  rejected.  For  many  students,  however,  their  admission  may 
be  highly  dependent  on  the  particular  admissions  officer  that  evaluates  their  application; 
thus  the  best  model  for  the  chances  of  these  borderline  students  involves  a  probability  of 
acceptance.  However,  every  student  is  either  accepted  or  rejected. 

A  physicist  is  attempting  to  determine  the  orientation  of  spin  for  particles  in  a  certain  magnetic 
field.  Presumably,  the  orientation  of  spin  is  at  least  partially  determined  by  a  genuinely 
random  process  of  Nature.  The  spin  of  any  particle  is  always  oriented  either  up  or  down. 

We  wish  to  produce  a  good  model  for  the  recognition  of  common  objects  such  as  chairs.  For 
most  objects  in  the  world,  there  is  nearly  universal  agreement  as  to  whether  that  object 
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is  a  chair  or  a  non-chair.  There  do  exist,  however,  a  few  objects  that  provoke  widespread 
disagreement,  such  as  stools  and  benches.  This  is  due  to  the  fact  that  the  concept  of 
“chair”  is  not  absolute,  and  philosophical  boundaries  of  this  concept  may  be  exposed 
by  both  naturally  occurring  and  artificially  constructed  objects.  Most  young  children, 
however,  are  not  explicitly  told  about  such  definitional  shortcomings;  they  are  simply 
told  whether  or  not  something  is  a  chair. 

There  are  some  obvious  common  themes  in  each  of  the  above  situations.  First,  in  each 
there  is  uncertain  or  probabilistic  behavior.  This  uncertainty  may  arise  for  radically  different 
philosophical  reasons.  For  example,  in  the  case  of  the  meteorologist,  it  could  be  that  while  the 
weather  is  in  principle  a  deterministic  process,  the  parameters  measured  by  the  meteorologist 
and  the  limited  accuracy  of  these  measurements  are  insufficient  to  determine  this  process. 
In  the  case  of  the  physicist,  the  electron  spin  is  believed  to  be  governed  to  some  degree  by  a 
truly  random  process.  In  the  case  of  the  statistician,  the  uncertainty  arises  from  the  diversity  of 
human  behavior,  and  in  the  case  of  chair  recognition,  a  probability  may  model  the  philosophical 
difficulties  of  providing  a  deterministic  definition  to  an  inherently  uncertain  or  “fuzzy”  concept. 

A  second  theme  that  is  common  to  each  of  these  settings  is  the  fact  that  even  though 
the  best  model  may  be  a  conditional  probability  p(x)  that  the  event  (rain,  acceptance  to  the 
college,  etc.)  occurs  given  x  (where  x  represents  the  measured  weather  variables  or  a  student’s 
application),  the  observer  only  witnesses  whether  or  not  the  event  occurs.  Thus,  examples  are 
of  the  form  (x,0)  or  (x,  1)  —  not  (x,p(x))  —  and  the  {0, 1}  label  provided  with  x  is  distributed 
according  to  the  conditional  probability  p(x).  Furthermore,  we  should  not  expect  to  be  able  to 
compute  even  an  estimate  of  p(x)  from  the  given  {0,  l}-labeled  examples,  since  in  general  we 
are  unlikely  to  ever  see  the  same  x  twice  (each  day’s  weather  is  at  least  slightly  different,  as  is 
each  student’s  application). 

Finally,  although  there  is  uncertainty  in  each  of  these  settings,  there  is  also  some  structure  to 
this  uncertainty.  For  instance,  days  with  nearly  identical  atmospheric  conditions  and  students 
with  very  similar  high  school  records  can  be  expected  to  have  nearly  equal  probabilities  of 
rain  and  acceptance  to  the  college,  respectively.  We  also  expect  some  inputs  x  to  be  assigned 
conditional  probabilities  that  are  very  near  0  or  1;  for  example,  days  on  which  the  sky  is  cloudless 
or  students  with  straight  A’s.  This  structured  behavior  strongly  distinguishes  these  learning 
scenarios  from  a  “noisy”  setting,  such  as  that  considered  by  Angluin  and  Laird  [9],  Kearns 
and  Li  [50],  and  Sloan  [81].  In  a  model  of  learning  with  noise,  the  noise  is  typically  “white” 
(that  is.  all  inputs  have  either  an  equal  probability  of  corruption  or  a  probability  determined 
by  an  adversary),  and  the  noise  is  regarded  as  something  an  algorithm  wishes  to  “filter  out” 
in  an  attempt  to  uncover  some  underlying  deterministic  concept.  In  the  examples  given  above, 
the  probabilistic  behavior  is  both  structured  (possibly  in  a  manner  that  can  be  exploited  by  a 
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learning  algorithm)  and  inherently  part  of  the  underlying  phenomenon.  Thus,  whenever  possible 
we  do  not  wish  to  filter  this  probabilistic  behavior  out  of  the  hypothesis,  but  rather  to  model 
it. 

In  this  chapter  we  wish  to  study  a  model  of  learning  in  such  uncertain  environments.  We 
formalize  these  settings  by  introducing  the  notion  of  a  probabilistic  concept  (or  p-concept).  A 
p-concept  c  over  a  domain  set  X  is  simply  a  mapping  c  :  X  — ►  [0,1].  For  each  x  €  X,  we 
interpret  c(x)  as  the  probability  that  x  is  a  positive  example  of  the  p-concept  c.  Following 
the  discussion  above,  a  learning  algorithm  in  this  framework  is  attempting  to  infer  something 
about  the  underlying  target  p-concept  c  solely  on  the  basis  of  labeled  examples  (x,b),  where 
b  €  {0, 1}  is  a  bit  generated  randomly  according  to  the  conditional  probability  c(x),  i.e.,  b  =  1 
with  probability  c(x). 

The  value  c(x)  may  be  viewed  as  a  measure  of  the  degree  to  which  x  exemplifies  some 
concept  c.  In  this  sense,  p-concepts  are  quite  similar  to  the  related  notion  of  a  fuzzy  set ,  a  kind 
of  “set”  whose  boundaries  are  fuzzy  or  unclear,  and  whose  formal  definition  is  nearly  identical 
to  that  of  a  p-concept.  An  axiomatic  theory  of  fuzzy  sets  was  introduced  by  Zadeh  [89],  and 
they  have  since  received  much  treatment  by  researchers  in  the  field  of  pattern  recognition.  See 
Kandel’s  book  [47]  for  a  good  introduction. 

We  distinguish  two  possible  goals  for  a  learning  algorithm  in  the  p-concept  model.  The  first 
and  easier  goal  is  that  of  label  prediction:  the  algorithm  wishes  to  output  a  hypothesis  that 
maximizes  the  probability  of  correctly  predicting  the  {0,1}  label  generated  by  c  on  an  input 
x.  We  call  this  kind  of  learning  decision-rule  learning,  since  we  are  not  primarily  concerned 
with  actually  modeling  the  underlying  uncertainty  but  instead  wish  to  accurately  predict  the 
observable  {0, 1}  outcome  of  this  uncertainty.  The  more  difficult  and  more  interesting  goal  is 
that  of  finding  a  good  model  of  probability.  Here  the  algorithm  wishes  to  output  a  hypothesis 
p-concept  h  :  X  —>  [0.1]  that  is  a  good  real-valued  approximation  to  the  target  c;  thus,  we 
want  |c(x)  —  h(x) |  to  be  small  for  most  inputs  x.  Following  the  motivation  given  above,  we  are 
mainly  concerned  with  this  latter  notion  of  learning. 

To  model  the  aforementioned  structure  of  the  target  p-concept,  we  study  the  learnability  of 
classes  of  p-concepts  that  obey  natural  mathematical  properties  intended  to  model  some  realistic 
environments.  As  a  simple  example,  in  constructing  a  p-concept  model  of  the  subjective  notion 
of  “tall,”  it  is  reasonable  to  assume  that  x  >  y  implies  c(x)  >  c(y)  (where  x  represents  height) 
—  the  taller  a  person  actually  is.  the  higher  the  percentage  of  people  who  will  agree  he  is  tall  (or 
the  greater  the  “degree  of  tallness”  we  wish  to  assign).  This  motivates  us  to  consider  learning 
the  class  C  of  all  non-decreasing  p-concepts  over  the  positive  real  line.  In  general,  we  wish 
to  study  the  learnability  of  p-concept  classes  that  are  restricted  in  such  a  way  as  to  plausibly 
capture  some  realistic  situation,  but  are  not  so  restricted  as  to  make  the  learning  problem  trivial 
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or  uninteresting. 

We  adopt  from  the  Valiant  model  for  learning  deterministic  concepts  [83]  the  emphasis 
on  learning  algorithms  that  are  both  efficient  (in  the  sense  of  polynomial  time)  and  general 
(in  the  sense  of  working  for  the  largest  possible  p-concept  classes  and  against  any  probability 
distribution  over  the  domain).  After  formalizing  the  learning  model  and  the  two  possible  goals 
for  a  learning  algorithm  (decision-rule  learning  and  model-of-probability  learning),  we  embark 
on  a  systematic  study  of  techniques  for  designing  efficient  algorithms  for  learning  p-concepts 
and  the  underlying  theory  of  the  p-concept  model. 

We  begin  by  giving  examples  of  efficient  algorithms  producing  a  good  model  of  probabil¬ 
ity  that  employ  what  we  call  the  direct  approach;  the  analyses  of  these  algorithms  give  first- 
principles  arguments  that  the  output  hypothesis  is  good.  These  include  algorithms  for  arbitrary 
non-decreasing  functions  motivated  above,  a  probabilistic  analog  of  Rivest’s  decision  lists  [72], 
and  a  class  of  “hidden-variable”  p-concepts,  motivated  by  settings  such  as  weather  prediction 
where  the  apparently  probabilistic  behavior  may  in  part  be  due  to  the  fact  that  some  relevant 
quantities  remain  undiscovered. 

We  then  consider  the  problem  of  hypothesis  testing  in  the  p-concept  model.  Working  within 
the  framework  suggested  by  Haussler  [36],  we  define  a  loss  function  that  assigns  a  measure 
of  goodness  to  any  hypothesis  p-concept  on  a  {0,  l}-labeled  sample.  After  proving  that  the 
quadratic  loss  measure  is  most  appropriate  for  our  setting,  we  then  give  an  example  of  an 
efficient  algorithm  for  finding  a  model  of  probability  that  first  does  some  direct  computation 
in  order  to  narrow  the  search  and  then  uses  quadratic  loss  to  choose  the  best  hypothesis  from 
among  a  small  remaining  pool.  This  algorithm  learns  a  class  of  p-concepts  in  which  only  a 
small  number  of  variables  are  relevant,  but  the  dependence  on  these  variables  may  be  arbitrary. 

Next  we  consider  the  related  but  more  difficult  issue  of  uniform  convergence  of  a  p-concept 
class.  More  precisely,  how  many  {0,  l}-labeled  examples  must  be  taken  before  we  have  high 
confidence  that  every  p-concept  in  the  class  has  an  empirical  quadratic  loss  that  accurately 
reflects  its  true  performance  as  a  model  of  probability?  In  a  more  general  formulation,  this 
question  has  received  extensive  consideration  in  the  statistical  pattern  recognition  literature, 
and  its  importance  to  learning  has  been  demonstrated  by  many  recent  papers.  We  show  that  the 
sufficient  sample  size  for  uniform  convergence  is  bounded  above  by  the  quadratic  loss  dimension 
of  the  p-concept  class,  a  combinatorial  measure  derived  from  the  combinatorial  dimension 
discussed  by  Haussler  [36]  and  other  authors. 

We  then  give  efficient  algorithms  that  apply  the  uniform  convergence  method  (that  is,  take 
a  large  enough  sample  as  dictated  by  the  quadratic  loss  dimension,  and  find  the  hypothesis 
minimizing  the  empirical  loss  over  the  sample)  in  order  to  find  a  good  model  of  probability. 
In  particular,  we  prove  the  effectiveness  of  an  algorithm  for  learning  p-concepts  represented  by 
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linear  combinations  of  d  given  basis  functions.  We  then  show  that  the  quadratic  loss  dimension, 
when  finite,  is  also  a  lower  bound  on  the  required  sample  size  for  learning  any  p-concept  class 
with  a  model  of  probability;  thus  the  quadratic  loss  dimension,  when  finite,  characterizes  the 
sample  complexity  of  p-concept  learning  with  a  model  of  probability  in  the  same  way  that  the 
Vapnik-Chervonenkis  (VC)  dimension  characterizes  sample  complexity  in  Valiant’s  model.  (See 
Blumer  et  al.’s  paper  [14]  for  a  full  discussion  of  the  VC-dimension.)  However,  we  show  that 
p-concept  classes  of  infinite  quadratic  loss  dimension  may  sometimes  be  learned  efficiently,  in 
contrast  to  classes  of  infinite  VC-dimension  in  the  Valiant  model,  which  are  not  learnable  in 
any  amount  of  time.  (Technically,  this  is  not  always  true  if  “dynamic”  sampling  is  allowed;  see 
Linial,  Mansour  and  Rivest’s  paper  [58]  for  further  details.) 

We  conclude  with  am  investigation  of  Occam’s  Razor  in  the  p-concept  model.  In  the  Valiant 
model,  Blumer  et  al.  [13]  show  that  it  suffices  for  learning  to  find  a  consistent  hypothesis  that 
is  slightly  shorter  than  the  sample  data.  We  look  for  analogies  in  our  setting:  namely,  when 
does  “data  compression”  imply  a  good  model  of  probability?  We  formalize  this  question,  and 
argue  briefly  that  several  of  our  algorithms  can  be  interpreted  as  implementing  a  form  of  data 
compression. 

The  primary  contribution  of  this  research  is  that  of  providing  positive  results  for  efficient 
learnability  in  a  natural  and  important  extension  to  Valiant’s  model.  This  is  significant  because 
the  Valiant  model  has  been  criticized  for  both  its  strong  hardness  results  (and  drought  of 
powerful  positive  results),  and  for  the  unrealistic  deterministic  and  noise-free  view  it  takes 
of  the  concepts  to  be  learned.  While  at  first  it  may  seem  paradoxical  that  we  are  able  to 
simultaneously  generalize  the  model  and  obtain  many  positive  results,  this  intuitively  may  be 
explained  by  the  fact  that  since  we  generalize  the  form  of  the  representations  being  learned, 
there  are  more  ways  in  which  concepts  that  capture  some  natural  and  realistic  setting  may 
be  simply  expressed.  In  contrast,  since  the  Valiant  model  tends  to  emphasize  concept  classes 
based  on  standard  circuit  complexity,  one  is  quickly  led  to  study  very  powerful  and  apparently 
difficult  classes  such  as  disjunctive  normal  form  Boolean  expressions. 

Another  contribution  of  this  research  is  in  demonstrating  the  feasibility  and  practicality  of 
the  approach  suggested  by  Haussler  [36].  His  work  addressed  the  issue  of  sample  complexity 
upper  bounds  in  great  generality,  even  encompassing  the  case  where  the  input-output  relation 
to  be  learned  has  no  prescribed  functional  form.  This  generality  prevents  Haussler  from  ob¬ 
taining  either  good  sample  size  lower  bounds  or  efficient  learning  algorithms;  indeed,  he  cites 
both  of  these  as  important  areas  for  further  research.  Our  results  may  be  regarded  as  a  first 
demonstration  of  applying  some  of  Haussler’s  general  principles  to  a  specific  and  realistic  model 
in  which  computation  time  is  of  foremost  significance. 
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4-2  The  learning  model 

Let  X  be  a  set  called  the  domain  (or  instance  space).  A  probabilistic  concept  (or  p-concept )  is  a 
real-valued  function  c  :  X  — ♦  [0, 1].  When  learning  the  p-concept  c,  the  value  c(x)  is  interpreted 
as  the  probability  that  x  exemplifies  the  concept  being  learned  (i.e.,  the  probability  that  x  is  a 
positive  example).  A  p-concept  class  C  is  a  family  of  p-concepts.  On  any  execution,  a  learning 
algorithm  for  C  is  attempting  to  learn  a  distinguished  target  p-concept  c  €  C  with  respect  to  a 
fixed  but  unknown  and  arbitrary  target  distribution  D  over  X.  We  think  of  D  as  modeling  the 
natural  distribution  of  objects  in  the  domain,  and  c  represents  the  probabilistic  concept  to  be 
learned  in  this  domain. 

(More  formally,  D  is  a  probability  measure  on  a  <r-algebra  of  measurable  subsets  of  X.  We 
assume  implicitly  that  all  of  the  p-concepts  considered  are  measurable  functions  with  respect 
to  this  <r-algrebra  on  X,  and  the  Borel  cr-algebra  on  [0, 1).) 

The  learning  algorithm  is  given  access  to  an  oracle  EX  that  behaves  as  follows:  EX  first 
draws  a  point  x  €  A'  randomly  according  to  the  distribution  D.  Then  with  probability  c(x), 
EX  returns  the  labeled  example  (x,l)  and  with  probability  1  —  c(x)  it  returns  (x,0).  Thus, 
learning  algorithms  never  have  direct  access  to  the  conditional  probabilities  c(x),  but  only  to 
random  examples  whose  labels  are  distributed  according  to  these  unknown  probabilities. 

Let  ft  be  a  function  mapping  X  into  {0,1};  we  call  such  a  function  a  decision  rule.  We 
define  the  predictive  error  of  ft  on  c  with  respect  to  D,  denoted  Rp{c,h),  as  the  probability 
that  h  will  misclassifv  a  randomly  drawn  point  from  EX.  If  h  minimizes  Rd(c,  •),  then  we  say 
that  h  is  a  best  decision  rule ,  or  a  Bayes  optimal  decision  rule,  for  c.  We  say  that  h  is  an  e-good 
decision  rule  for  c  if  RD(c,h)  <  Rp(c,h)  +  c,  where  h  is  a  best  decision  rule.  Thus  we  ask  that 
h  be  nearly  as  good  as  the  best  decision  rule  for  c. 

The  projection  of  the  p-concept  c  is  the  function  7rc  :  X  — * •  {0, 1}  that  is  1  if  c(x)  >  1/2  and 
0  if  c(x)  <  1/2.  It  is  well  known  and  easy  to  show  that  for  any  target  p-concept  c,  its  projection 
7rc  is  a  Bayes  optimal  decision  rule. 

In  this  chapter  we  are  primarily  interested  not  in  the  problem  of  finding  a  good  decision 
rule,  but  in  that  of  producing  an  accurate  real-valued  approximation  to  the  target  p-concept 
itself.  Thus,  we  wish  to  infer  a  good  model  of  probability  with  respect  to  the  target  distribution. 
We  say  that  a  p-concept  h  is  an  ( e,y)-good  model  of  probability  of  c  with  respect  to  D  if  we 
have  Prr€o(|/i(x)  -  c(x)|  >  7]  <  e.  Thus,  the  value  of  h  must  be  near  that  of  c  on  most  points 
x. 

We  are  now  ready  to  describe  our  model  for  learning  p-concepts.  Let  C  be  a  p-concept  class 
over  domain  A’.  We  say  that  C  is  leamable  with  a  model  of  probability  (respectively,  leamable 
with  a  decision  rule)  if  there  is  an  algorithm  A  such  that  for  any  target  p-concept  c  €  C,  for 
any  target  distribution  D  over  A’,  for  any  inputs  e  >  0,  6  >  0  (and  7  >  0  for  learning  with 
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a  model  of  probability),  algorithm  A,  given  access  to  EX,  halts  and  with  probability  at  least 
1-6  outputs  a  p-concept  h  that  is  an  («, *y)-good  model  of  probability  (respectively,  an  e-good 
decision  rule)  for  c  with  respect  to  D.  Note  that  this  model  of  learning  p-concepts  generalizes 
Valiant’s  model  for  learning  deterministic  concepts. 

We  say  that  C  is  polynomially  leamable  if  A  runs  in  time  polynomial  in  1/e,  1/6  and,  where 
appropriate,  1/7.  Often  the  p-concept  class  C  will  be  parameterized  by  a  complexity  parameter 
n,  that  is  C  =  Un>i^n’  and  p-concepts  in  C„  share  a  common  subdomain  Xn.  In  such  cases 
we  also  allow  a  polynomial  dependence  on  n. 

Our  first  lemma  shows  that  a  good  model  of  probability  can  always  be  efficiently  used 
as  a  good  decision  rule;  thus,  learning  with  a  model  of  probability  is  a  harder  problem  than 
decision- rule  learning. 

Lemma  2.1  Let  C  be  a  class  of  p-concepts.  If  C  is  (polynomially)  leamable  with  a  model  of 
probability,  then  C  is  (polynomially)  leamable  with  a  decision  rule. 

Proof:  To  prove  the  lemma,  we  show  that  the  projection  of  a  good  model  of  probability  can 
be  used  as  a  good  decision  rule.  In  particular,  we  show  that  if  h  is  an  (c,7)-good  model  of 
probability,  then  irh  is  an  (e  +  27)-good  decision  rule.  Thus,  by  choosing  c  and  7  appropriately, 
an  arbitrarily  good  decision  rule  can  be  found  by  the  assumed  algorithm  for  learning  with  a 
model  of  probability. 

Let  x  €  X,  and  suppose  |/i(x)  -  c(x)|  <  7.  If  |c(x)  -  1/2|  >  7,  then  clearly  ir/,(x)  =  nc(x). 
On  the  other  hand,  if  |c(x)  -  1/2|  <  7.  then  it  may  be  that  ?rh(x)  ^  xc(x)-  However,  the  chance 
that  7rc(x)  agrees  with  a  random  label  for  x  (chosen  according  to  c)  is  at  most  1/2  +  7,  while 
the  chance  that  nh(x)  agrees  with  the  random  label  is  at  least  1/2  —  7. 

Thus,  the  difference  in  predictive  error  between  7r„  and  7rA  (taken  over  a  random  choice  of 
an  instance  and  its  label)  is  at  most  €  +  (l  —  c)-27<e  +  27.  ■ 

4-2.1  Alternative  formulations 

In  addition  to  the  formulation  given  above,  there  are  various  other  natural  ways  of  expressing 
the  fact  that  some  hypothesis  p-concept  h  is  “close”  to  the  target  c.  For  example,  we  might  say 
that  h  is  a  good  model  of  probability  for  c  if  the  average  difference  between  the  two  functions 
is  small,  i.e.,  if  the  quantity  Er?0  (|/i(x)  -  c(x)|]  is  small.  Alternatively,  we  might  ask  that  the 
expected  square  of  the  difference  between  the  functions  be  small. 

As  we  will  see  in  the  following  sections,  these  alternative  definitions  are  sometimes  easier 
to  work  with  than  the  “official”  definition  given  above.  The  next  lemma  shows  that  the  three 
formulations  are  equivalent  modulo  polynomial-time  computation. 
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Lemma  2.2  Let  h  and  c  be  p-concepts,  and  let  D  be  a  target  distribution  on  domain  X .  Let 
ei  =  E ,€0  [|Mx)  -  c(x)|]  and  let  e2  =  Er€D  [( h(x )  -  c(x))2] .  Then 

•  «2  <  ex  <  y/e^; 

•  for  any  7  >  0,  h  is  both  an  (ei/7,7)-  and  an  (e2/72,7)-good  model  of  probability  for  c; 

•  if  h  is  an  (f,7 )-good  model  of  probability,  then  <  €  +  7,  and  e2  <  e  +  y2. 

Proof:  Since  |h(x)  -  c(x)|  <  1  for  all  x ,  it  is  clear  that  e2  <  t\.  Also,  by  a  convexity  argument 
we  have  (E[F])2  <  E[V2]  for  any  random  variable  Y.  Thus,  ex  <  y/e 2. 

Let  7  >  0.  Then  by  Markov’s  inequality, 

Prr€D  [|/i(x)  -  c(x)|  >  7]  <  dis¬ 
similarly, 

Prr€D  [\h(x)  -  c(x)|  >  7]  =  Pfreo  \(Hx)  -  c(x))2  >  72]  <  e2/72. 

These  imply  the  second  part  of  the  lemma. 

Finally,  suppose  h  is  an  (c,7)-good  model  of  probability.  Then 

ei  <  Pr,€D[|M*)  -  c(x)|  >  7]  •  1  +  Prr6£>  [|h(x)  —  c(x)|  <  7]  •  7  <  c  4-  (1  -  t)l  <  (  +  7- 
Similarly,  e2  <  c  +  72.  ■ 


4-3  Efficient  algorithms:  The  direct  approach 

In  this  section,  we  describe  efficient  algorithms  for  learning  good  models  of  probability  based 
on  first  principles  and  proved  correct  by  direct  arguments.  Later  arguments  will  rely  on  an 
underlying  theory  of  p-concept  learning  that  is  developed  in  subsequent  sections.  We  begin  with 
a  p-concept  class  motivated  by  the  problem  of  modeling  “tallness"  discussed  in  the  in  roduction. 


4-3.1  Increasing  functions 

Theorem  3.1  The  p-concept  class  of  all  nondecreasing  functions  c  :  R  — ►  [0, 1]  is  polynomially 
leamable  with  a  model  of  probability. 


Proof:  We  prove  the  result  in  slightly  greater  generality  for  any  domain  X  linearly  ordered  by 
some  ordering  “<.”  Given  positive  c.  b  and  7,  let  t  =  [4/C7],  and  let 


s 


max 
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Our  algorithm  begins  by  drawing  a  labeled  sample  of  m  =  st  examples  (x,,6,).  The  examples 
are  sorted  and  reindexed  so  that  xx  <  ■  ■  •  <  xm.  In  fact,  we  assume  initially  that  no  instance 
occurs  twice  in  the  sample  so  that  ij  <  •  •  •  <  xm.  Later,  we  show  how  this  assumption  can  be 
removed. 

The  set  X  can  naturally  be  partitioned  into  t  disjoint  intervals  /,-,  each  containing  exactly  s 

instances  of  the  sample;  specifically,  welet/i  =  (-oo,x,];/j  =  (xy.^Xj,]  for  j  =  2,3,. .  .,t-l; 

and  I,  =  (x(,_i),,oo).  For  1  <  j  <  t,  let  p;  =  (1/s)  ■  J2Xi6/  6,.  Thus,  p ,  is  an  estimate  of  the 

probability  pj  that  a  random  instance  in  I,  is  labeled  1.  Our  algorithm  outputs  a  step  function 

h  defined  in  a  natural  manner:  for  x  €  /,,  we  define  h(x)  =  pj. 

This  algorithm  clearly  runs  in  polynomial  time.  We  argue  next  that  the  output  hypothesis 

h  is  an  (c,7)-good  model  of  probability  (with  high  probability).  Here  are  the  high-level  ideas: 

first,  we  show  that  (with  high  probability)  each  interval  has  weight  approximately  €7  under  the 

target  distribution.  Next  we  show  that  if  c  increases  by  roughly  7  or  less  on  the  interval  Ij, 

then  h  is  close  to  c  on  all  points  in  the  interval.  On  the  other  hand,  since  c  is  nondecreasing 

and  bounded  between  0  and  1,  c  can  increase  by  more  than  7  in  at  most  1/7  intervals;  since 
* 

these  “bad”  intervals  have  total  weight  at  most  e,  h  is  a  good  model  of  probability. 

Specifically,  we  can  apply  the  uniform  convergence  results  of  Vapnik  and  Chervonenkis  [85] 
to  show  that  each  interval  Ij  has  probability  at  most  €7/2.  Let  S  be  the  set  of  all  intervals  on 
X.  Then  Theorem  2  of  theiT  papeT  shows  that,  with  probability  at  least  1  -  6/2,  for  the  sample 
size  m  chosen  by  our  algorithm,  the  relative  fraction  of  points  of  the  sample  occurring  in  any 
interval  of  5  is  within  c)/4  of  the  true  weight  of  the  interval  under  the  target  distribution.  In 
particular,  since  each  interval  Ij  contains  l/t  <  €7/4  of  the  instances  in  the  sample,  the  weight 
of  Ij  under  the  target  distribution  is  at  most  €7/2. 

(Technically,  their  results  rely  on  certain  measurability  assumptions  which  depend  on  the 
choice  of  A'.  However,  these  assumptions  are  satisfied  when  A'  =  R.) 

Let  qj  =  c(xj,)  for  1  <  j  <  t,  and  let  q0  =  0  and  q,  =  1.  Then  for  x  €  Ij,  it  is  clear  that 
<7;_  1  <  c(x)  <  qj  since  c  is  nondecreasing.  In  particular,  this  is  true  for  each  x,  €  Ij.  Thus, 
each  point  x,  6  Ij  is  labeled  1  with  probability  c(x,)  >  and  so,  for  each  j,  pj  >  qj.x  -  7/2 
with  probability  at  least  1  —  6/4 <;  this  follows  from  the  fact  that  s  >  (2/j2)  -ln(4f/6),  and  by 
applying  the  additive  form  of  Chernoff  bounds  given  in  Lemma  2-3.6.  Similarly,  pj  <  q3  +  7/2 
with  probability  at  least  1  -  6 /At.  Thus,  it  follows  that  g;_!  -  7/2  <  p;  <  g;  +7/2  for  all  j 
with  probability  at  least  1  -  6/2.  Hence,  if  qj  -  g;_i  <  7/2,  then  | h(x)  —  c(x)|  <  7  for  x  6  Ij- 
On  the  other  hand,  g;  -  1  can  exceed  7/2  for  at  most  2/7  values  of  j  since  c  is  nonde¬ 

creasing,  and  bounded  between  0  and  1.  Since  each  of  these  “bad”  intervals  has  probability 
weight  at  most  €7/2,  the  sum  total  probability  of  these  intervals  under  D  is  at  most  e.  Thus, 
h  is  an  (e,  7)-good  model  of  probability. 
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Finally,  we  show  how  to  insure  that  the  sample  does  not  contain  the  same  instance  more 
than  once.  Such  a  situation  could  be  problematic  for  our  algorithm  since  it  might  cause  some 
of  the  intervals  defined  above  to  be  empty,  or  to  contain  too  many  sample  points. 

The  idea  is  to  replace  the  given  domain  A'  and  target  distribution  D  with  a  new  domain 
X'  and  distribution  D'  under  which  the  same  instance  is  very  unlikely  to  occur  twice.  In 
particular,  we  let  X'  =  X  x  T  and  D'  =  D  x  U  where  U  is  the  uniform  distribution  on  the 
set  T  =  {0, ...,2*  -  1},  and  k  =  [21gm  +  lg(l/6J].  Then  X'  is  linearly  ordered  under  the 
lexicographic  ordering  (i.e.,  ( x,r )  <  ( y,s )  if  and  only  if  x  <  y,  or  x  =  y  and  r  <  s).  Also,  the 
chance  that  any  pair  of  instances  are  the  same  in  a  sample  of  size  m  drawn  according  to  D'  is 
at  most  (™)  •  2~k  <  m2  ■  2~k~l  <  6/2. 

In  addition,  given  a  random  source  of  instances  from  X  drawn  according  to  D,  we  can  easily 
simulate  the  random  choice  of  instances  from  X'  according  to  D given  x  6  A-,  we  simply  draw 
a  random  number  r  uniformly  from  T,  yielding  an  instance  (x,  r)  with  distribution  D'  (x’s  label 
is  not  altered).  Thus,  the  previously  described  algorithm  can  be  simulated  (with  6  replaced 
by  6/2)  on  domain  X' .  If  the  same  instance  occurs  twice  in  the  sample,  the  algorithm  simply 
fails  —  as  argued  above,  this  will  happen  with  probability  at  most  6/2.  Thus,  with  probability 
at  least  1  —  6,  the  algorithm  returns  an  (e,7)-good  hypothesis  h  (with  respect  to  A”').  This 
hypothesis  can  be  used  to  estimate  c(x)  for  a  given  point  x  €  X  by  randomly  choosing  r  6  T 
and  evaluating  /i((x,r)).  Although  this  yields  a  randomized  hypothesis  h ',  it  remains  true  that 
the  probability  (over  choices  of  x  €  AT  and  the  randomization  of  h')  that  h'  differs  by  more  than 
7  from  c  is  at  most  t.  Thus,  h'  is  an  (e,7)-good  model  of  probability  if  h  is.  ■ 

This  algorithm  can  be  modified  to  learn  with  a  model  of  probability  any  function  over  the 
real  line  with  at  most  d  extremal  points:  the  running  time  is  then  polynomial  in  d,  1/e,  1/6  and 
1/7- 

In  principle,  the  algorithm  of  Theorem  3.1  could  be  used  to  learn  the  p-concept  class  of  non¬ 
decreasing  functions  with  a  decision  rule  (by  applying  Lemma  2.1).  However,  a  much  simpler 
and  more  efficient  algorithm  exists  that  we  give  in  Section  4-5. 

4-3.2  Probabilistic  decision  lists 

We  turn  next  to  the  problem  of  learning  a  probabilistic  analog  of  Rivest’s  decision  lists  [72].  We 
define  such  lists  with  respect  to  a  basis  Tn  of  Boolean-valued  functions  on  the  domain  {0, 1}". 
We  assume  always  that  Tn  contains  the  constant  function  1.  Then  a  probabilistic  decision  list 
c  over  basis  Tn  is  given  by  a  list  (/i,  r, ),...,(/,,  r,),  where  each  /<  €  Xn,  and  each  r,  €  [0,1]. 
We  also  assume  that  /,  is  the  constant  function  1.  For  any  assignment  x  in  the  domain,  c(x)  is 
defined  to  be  r;,  where  j  is  the  least  index  for  which  fj{x)  =  1.  In  other  words,  the  functions 
in  Tn  are  tested  one  by  one  in  the  order  specified  by  the  list,  until  a  function  which  evaluates 


118  Efficient  Distribution-free  Learning  of  Probabilistic  Concepts 


to  1  on  i  is  encountered;  the  corresponding  real  number  r;  is  then  the  probability  that  x  is 
labeled  1. 

Rivest  does  not  define  decision  lists  with  respect  to  a  general  basis  as  is  done  here.  Rather, 
in  his  definition,  a  decision  list  only  tests  the  values  of  monomials.  That  is,  he  defines  decision 
lists  specifically  with  respect  to  the  basis  consisting  of  all  conjunctions  of  literals.  He  goes 
on  to  define  the  class  &-DL  of  decision  lists  in  which  each  monomial  occurring  in  the  list  is  a 
conjunction  of  k  or  fewer  literals.  Thus,  this  class  is  over  the  basis  of  ail  monomials  of  size  at 
most  k.  Rivest  describes  an  efficient  algorithm  for  learning  the  class  fc-DL,  when  k  is  any  fixed 
constant. 

Below,  we  describe  an  efficient  algorithm  for  learning  a  special  class  of  probabilistic  decision 
lists  over  any  basis  Tn.  The  running  time  of  this  algorithm  is  polynomial  in  all  of  the  usual 
parameters,  in  addition  to  \Tn\,  and  the  maximum  time  needed  to  evaluate  any  function  /  in 
Tn.  Thus,  in  particular,  this  implies  a  polynomial-time  algorithm  for  the  same  basis  considered 
by  Rivest,  namely,  the  set  of  all  conjunctions  of  k  or  fewer  literals,  for  k  a  fixed  constant. 

Let  c  be  a  probabilistic  decision  list  over  basis  Tn,  given  by  the  list  (/j,  rx), ...,(/,,  ra). 
For  w  G  [0, 1],  we  say  that  c  is  a  probabilistic  decision  list  with  u>-converging  probabilities  if 
|r4  —  uj|  >  |ri+1  -  u>|  for  1  <  i  <  s.  Below,  we  describe  an  algorithm  for  inferring  such  lists  when 
u  is  known.  As  a  special  case,  when  u  =  0,  this  algorithm  can  be  used  to  learn  probabilistic 
decision  lists  with  decreasing  probabilities,  i.e.,  lists  in  which  rt  >  r,  for  i  <  j. 

Perhaps  the  most  natural  case  occurs  when  u  =  1/2.  In  this  case,  we  say  that  c  is  a  prob¬ 
abilistic  decision  list  with  decreasing  certainty  since  instances  with  the  most  certain  outcomes 
(labels)  are  handled  at  the  beginning  of  the  list.  For  instance,  a  college’s  admissions  process 
(see  Section  4-1)  might  be  naturally  modeled  in  this  manner  as  a  list  of  criteria  for  determining 
admission,  ordered  by  importance:  for  example,  if  the  student  has  straight  A’s,  then  he  should 
be  admitted  with  90%  probability;  otherwise,  if  he  did  poorly  on  his  SAT’s,  then  he  should 
be  rejected  with  85%  probability;  otherwise,  if  he  was  class  president,  then  he  should  be  ac¬ 
cepted  with  75%  probability;  and  so  on.  Note  that  the  class  of  probabilistic  decision  lists  with 
decreasing  certainty  includes  the  class  of  ordinary  (deterministic)  decision  lists  over  the  same 
basis. 

We  also  note  that  the  algorithm  given  below  in  Theorem  3.2  can  be  applied  to  learn  ordinary 
decision  lists  when  the  supplied  examples  are  “noisy.”  Specifically,  consider  the  problem  of 
learning  a  deterministic  decision  list  c  given  by  the  list  (/i,6|), ...,(/,,  6,)  where  each  /,  is 
in  the  basis  and,  since  the  list  is  deterministic,  each  6,  G  {0.1}.  Suppose  further  that 
the  classification  of  each  example  is  flipped  randomly  with  probability  rj  <  1/2.  This  random 
misclassification  noise  model  is  considered,  for  instance,  by  Angluin  and  Laird  [9].  Note  that  the 
observed  behavior  in  such  a  situation  can  be  modeled  naturally  by  the  probabilistic  decision  list 
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Input:  u>  €  [0, 1] 

basis  Tn  =  {/i,  —  >  /-} 
f,  6,  7  >  0 

access  to  random  examples  of  a  probabilistic  decision  list  over  basis  Tn 
with  w-converging  probabilities 

Output:  with  probability  at  least  1  -  6,  an  (c,7)-good  model  of  probability 
Procedure: 

1  L  <—  empty  list 

. <} 

3  obtain  a  sample  5  of  m  =  f(32s/c372)  •  ln(2,+2s/6)j  random  examples 

4  repeat 

5  if  |{(z,6)  €  S  :  fj{x)  =  1}|  <  me/ As  for  some  j  6  J  then 

6  t  <-  j 

7  p,  <-  0 

8  else 

9  for  j  £  J:  pj  *—  |{(x,6)  €  S  :  fj{x)  =  1  A  6  =  1}|  -r  |{(i,h)  €  5  :  fj(x)  =  1}| 

10  choose  t  that  maximizes  j p;  -  u\ 

11  L  *—  L,(ft,pt) 

12  5-  {(*,*)€  S  :/,(*)  =  0} 

13  J  4-  J  -  {t} 

14  until  J  =  0 

15  output  L 


Figure  1:  An  algorithm  for  learning  probabilistic  decision  lists  with  ^-converging  probabilities. 

c'  given  by  (/j,  |6,  -  tj|),  \b3  -  tj\).  That  is,  c'(x)  is  the  probability  that  x  is  labeled  1  by 

a  noisy  oracle  for  c.  Clearly,  c'  is  a  probabilistic  decision  list  with  1 /2-converging  probabilities. 
Thus,  we  can  apply  the  efficient  learning  algorithm  for  this  class  (described  below)  to  obtain 
a  good  model  of  probability  h  for  c'.  If  we  choose  7  <  1/2  -  77,  then  it  can  be  seen  that  the 
projection  of  h  is  a  good  approximation  of  c;  that  is.  with  probability  at  least  1  —6.  a  hypothesis 
h  is  obtained  for  which  Prr€jD  [jrA(x)  ^  c(i)j  <  c.  (Technically,  this  algorithm  assumes  that  77, 
or  an  upper  bound  on  tj.  is  known.  However,  if  no  such  bound  is  known,  Angluin  and  Laird  [9] 
give  a  technique  for  finding  a  good  bound  using  a  kind  of  “binary  search.”) 

Thus,  a  corollary  of  Theorem  3.2  is  a  proof  that  deterministic  decision  lists  are  efficiently 
learnable  even  when  the  supplied  examples  are  randomly  misclassified  with  probability  77.  The 
running  time  is  then  polynomial  in  1/(1  -  2rj),  in  addition  to  the  usual  other  parameters.  This 
specifically  answers  an  open  question  proposed  by  Rivest  [72]  concerning  the  learnability  of 
decision  lists  in  such  a  noisy  setting. 

Theorem  3.2  Let  u  €  [0, 1]  be  fixed,  and  let  Tn  be  a  basis  of  functions.  Then  the  p-concept 
class  of  probabilistic  decision  lists  over  basis  Tn  with  u> -converging  probabilities  is  learnable  with 
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a  model  of  probability  (assuming  both  u  and  Tn  are  known).  Specifically,  this  class  can  be 
learned  in  time  polynomial  in  1/c,  1/7,  1/6,  n,  l^l  and  the  maximum  time  needed  to  evaluate 
any  function  in  Tn. 

Proof:  Our  learning  algorithm  for  this  p-concept  class  is  shown  in  Figure  1.  As  usual,  the 
algorithm  begins  by  drawing  a  large  sample  S  of  size  m  which  will  be  used  to  construct  a 
hypothesis  probabilistic  decision  list  L.  (Note  that  S  and  all  subsets  derived  from  5  are  multisets 
—  they  are  “sets”  which  may  contain  multiple  copies  of  the  same  example.) 

Assume  for  convenience  that  the  functions  in  Tn  are  indexed  so  that  the  target  p-concept 
c  is  given  by  the  list  (/i,ri),... (Of  course,  the  learning  algorithm  is  not  aware  of 
this.)  We  also  assume  without  loss  of  generality  that  every  function  in  the  basis  Tn  occurs  in 
the  target  list  so  that  s  =  \Tn\. 

Here  is  the  intuition  behind  our  algorithm:  using  the  sample,  we  might  estimate  the  prob¬ 
ability  pi  that  a  positive  random  example  (x,l)  is  drawn,  given  that  /,( z)  =  1.  It  can  be 
shown  to  follow  from  the  definition  of  w-converging  decision  lists  that  |p:  —  >  \ p{  —  u>|  for 
all  i.  This  suggests  a  technique  for  identifying  the  first  variable  in  the  list:  if  our  estimates  p* 
are  sufficiently  accurate,  we  would  expect  |p,  —  u>|  to  be  maximized  when  i  =  1.  This  is  the 
approach  taken  by  our  algorithm:  the  function  fi  for  which  \pi  -  u\  is  greatest  is  placed  at  the 
head  of  the  hypothesis  list.  The  remainder  of  the  list  is  constructed  iteratively  using  the  part 
of  the  sample  on  which  f,(x)  =  0. 

For  I  C  {1 . s}  and  j  £  {1, ...,«},  let  A(I,j)  be  the  set  of  all  instances  x  for  which 

fj(x)  =  1  and  fi(x)  =  0  for  all  i  £  I .  Let 

u{I.j)  =  PrieD[x  6  .4(7,  j)] 


and 

v(I,j)  =  Pr(r,4)€£x[6  =  1  |  x  €  A(I,j)]. 

Also,  let  u(/,j)  and  v(I,j )  be  empirical  estimates  of  these  quantities  derivable  from  the  sample 
S  in  the  obvious  manner. 

Let  7  C  {1 . s}  and  j  £  {l,...,s}  be  fixed.  Then,  using  the  multiplicative  form  of 

Chernoff  bounds  given  by  Lemma  2-3.6,  it  follows  that  if  u(I,j)  >  c/2 s  then,  since  m  > 
( I6s/c) • ln(2*+1s/6), 

u(IJ)  >  \  ■  u(I,j ) 

with  probability  at  least  1  -  6/(s  ■  2,+1).  Furthermore,  if  u(I.j)  >  c/4 s,  then  the  number  of 
instances  x  6  A(I,j)  included  in  S  is  at  least  me/ 4s  >  (8/e272)  •  ln(2’+2s/<5).  Thus,  applying 
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the  additive  form  of  Chernoff  bounds,  we  see  that 

\v{IJ)~v(I,j)\  <  *7/4 


with  probability  at  least  1  -  S/2,+1s ,  assuming  u(I,j)  >  e/4 s. 

Thus,  with  probability  at  least  1  -  6,  a  sample  5  is  chosen  such  that  for  all  /  C  {1, . .  .,s} 
and  for  all  j  G  {1, . . .,  s},  we  have  that 


u(I,j)  <  max  ^,2u(/,;')J  , 

(3.1) 

and,  whenever  u(I,j)  >  f/4 s,  we  also  have  that 

(3.2) 

We  assume  henceforth  that  all  of  the  empirical  estimates  u(I.j)  and  v(I,j)  satisfy  the 
conditions  described  above.  As  just  argued,  this  will  be  the  case  with  probability  at  least 
1  —  6.  To  complete  the  proof,  we  show  that  this  assumption  implies  that  the  algorithm’s  output 
hypothesis  h  is  an  (e,7)-good  model  of  probability. 

Suppose  h  is  given  by  the  list  (/«,,  rj), . r').  Let  T,  =  {t\,  To  prove  that  h  is 

an  (e,7)-good  model  of  probability,  we  show  that,  for  1  <  i  <  s,  either 

Prl€o  [x  G  A(Ti-i,  <<)]  <  (/2s 

(3.3) 

or 

Pfreo  l\h(x)  ~  c(x)|  >  7  1  x  €  A(T,_i,t,)]  <  e/2. 

Note  that  the  sets  A(T,^x,t,)  are  disjoint.  Thus,  this  implies 

(3.4) 

Prx€D  [I h{x)  -  c(x)|  >  7] 

3 

=  ^Prreo  [|*(x)  -  c(x)|  >  7  I  X  €  A(Ti-i,ti)]  ■  Prr6C  [x  G  A(T,_i,t,)] 

i  =  l 

<  e 

as  can  be  seen  by  breaking  the  sum  into  two  parts  based  on  whether  Prr€£  [x  G  j4(T,_i,  <,•)] 
exceeds  or  does  not  exceed  e/2 s. 

Fix  t,  and  consider  the  ith  iteration  of  our  algorithm.  Prior  to  the  extension  of  L  at  line  11. 

the  hypothesis  list  is  (/,,, r' [), - (/«._,,  r'_,).  Let  Cj  =  A{Ti-x,j).  Also,  let  p;  =  t'(T,_i,j), 

and  observe  that,  as  defined  in  the  figure,  pj  =  t )(Ti_1,/).  This  follows  from  the  fact  that,  at 
this  point  in  the  execution  of  the  algorithm,  all  examples  (x.b)  in  5  are  such  that  /*(x)  =  0  for 
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k  € 

Let  t  be  as  in  the  figure  (i.e.,  t  =  t{).  If  t  was  chosen  at  line  6,  then  i , t)  <  c/4 s,  and 
so  u(Tj_!,<)  <  c/2 s  by  equation  (3.1).  Thus,  in  this  case,  equation  (3.3)  holds  by  definition  of 

u(J,j). 

Otherwise,  for  all  j  £  </,  j)  >  c/4s,  and  thus,  | pj  -  pj\  <  C7/4  by  equation  (3.2).  We 

wish  to  prove  that  equation  (3.4)  holds  in  this  case,  i.e.,  that 

P**r€0  [\pt  ~  c(i)|  >  7  I  *  €  Ct)  <  e/2. 

Let  u  be  the  smallest  member  of  J.  Then  pu  =  ru  by  definition  of  decision  lists.  Also,  since  c 
is  given  by  a  list  with  w-converging  probabilities,  |ru  -  u>|  >  |r;  -  wj  for  j  >  u.  Thus,  by  our 
choice  of  t ,  for  j  £  J, 

I T,  -  w|  <  |ru  -  w|  =  | pu  -  w|  <  jpu  -  u>\  +  C7/4  <  |p,  -  a; |  +  C7/4. 

Suppose  pt  >  u.  Then  clearly  r;  <  p,  4-  ey/4  for  j  £  J ,  and  thus  c(x)  <  pt  +  C7/4  <  p,  +  7 
whenever  x  €  Ct.  Let  z  be  the  probability  that  an  x  is  chosen  for  which  c(x)  <  p,  —  7,  given 
that  x  is  in  C 

2  =  Pr*g£>  [c(x)  <  pt  ~  7  I  x  £  Ct] . 

Then 


Pt  =  Ere0  [c(x)  |  x  £  C,] 

<  2(p»-7)  +  (l-*)(Pi  +  «7/4) 

<  *(Pi  +  «7/4-7)  +  (1-*)(Pi  +  <7/2) 

<  Pt  +  C7/2  -  fZ. 

This  implies  z  <  c/2,  and  so  (3.4)  holds  in  this  case.  The  proof  of  (3.4)  is  symmetric  when 
Pt  <  w. 

The  algorithm  of  Figure  1  clearly  runs  in  polynomial  time.  ■ 

It  is  an  open  question  whether  this  class  is  learnable  when  u;  is  unknown. 

The  class  of  probabilistic  decision  lists  has  also  been  considered  by  Yamanishi  [88].  He 
describes  an  algorithm,  based  on  the  principle  of  minimum  description  length,  for  learning  a 
model  of  probability  for  p-concepts  in  this  class;  however,  his  algorithm  is  not  computationally 
efficient.  Also.  Aiello  and  Mihail  [1]  have  recently  described  an  efficient  algorithm  for  learning 
arbitrary  probabilistic  decision  lists  over  the  basis  consisting  of  all  literals  in  the  special  case 
that  D  is  the  uniform  distribution. 
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4-3.3  Hidden-variable  problems 

We  next  consider  p-concept  classes  motivated  by  hidden-variable  problems,  in  which  there  is  an 
underlying  deterministic  concept,  but  the  settings  of  some  of  the  relevant  variables  are  invisible 
to  the  learning  algorithm,  resulting  in  apparent  probabilistic  behavior.  A  visible  monomial  p- 
concept  is  defined  over  {0, 1}"  by  a  pair  (A/,  a),  where  M  is  a  monomial  over  the  visible  Boolean 
variables  a:1,...,xn  and  a  €  [0,1].  The  associated  p-concept  c  is  defined  for  x  6  {0,1}"  to  be 
c(x)  =  a-M(x).  We  conceptually  regard  the  true  deterministic  concept  as  having  the  form  M A/, 
where  /  is  a  deterministic  concept  over  the  hidden  variables.  We  interpret  a  as  the  probability 
that  the  settings  of  the  invisible  variables  satisfy  I.  Note  that  we  assume  independence  between 
the  settings  for  the  variables  of  M  and  those  for  /. 

For  instance.  I  might  itself  be  a  monomial,  in  which  case  the  underlying  target  concept  is 
a  conjunction  of  literals,  some  which  are  visible  and  some  which  are  hidden. 

Visible  monomials  model  well  situations  in  which  certain  observable  conditions  are  requisite 
to  some  outcome,  but  in  which  these  conditions  are  not  in  themselves  enough  to  determine  the 
outcome  with  certainty.  Thus,  the  conditions  are  necessary,  but  not  sufficient,  and,  when  the 
conditions  are  met,  the  final  outcome  may  be  uncertain.  For  instance,  if  you  are  handed  a  drink 
that  is  brown  and  fizzes  and  tastes  sweet,  then  the  drink  might  be  Coke;  on  the  other  hand,  it 
might  not  be  Coke  (it  could  be  Pepsi).  In  any  case,  if  the  drink  lacks  any  one  of  these  qualities, 
then  it  certainly  cannot  be  the  real  thing. 

Theorem  3.3  The  class  of  visible  monomial  p-concepts  is  polynomially  learnable  with  a  model 
of  probability. 

Proof:  Let  the  target  p-concept  c  be  defined  by  the  pair  (A/, a),  and  let  the  target  distribution 
over  {0, 1}"  be  D.  We  describe  an  algorithm  that,  given  e.6  >  0,  outputs  with  probability  at 
least  1  -  6  a  hypothesis  h  for  which  Ex€D  [|/i(x)  —  c(x)|]  <  c;  Lemma  2.2  implies  that  such  an 
algorithm  can  be  converted  into  one  that  learns  a  good  model  of  probability. 

The  first  step  of  the  learning  algorithm  is  to  obtain  an  estimate  p  of  p  =  Pr(x,j)€£x[h  =  1] 
that,  with  probability  at  least  1  —  6/3,  is  such  that  \p  -  p\  <  c/3.  If  p  <  2c/3,  then  the 
algorithm  outputs  the  hypothesis  h(x)  =  0.  Assuming  p  has  the  desired  accuracy,  we  have 
E*6D  [|c(x)  -  h(x)|]  <  c  in  this  case  as  desired,  since  p  =  E r€£>  [c(x)]  <  c.  Otherwise,  p  >  2c/3, 
and  we  can  assume  henceforth  that  p  >  c/3  (as  is  the  case  with  probability  at  least  1  -  ^/3). 

Next  our  algorithm  attempts  to  learn  a  good  approximation  of  M.  This  is  done  using 
Valiant’s  algorithm  [83],  here  denoted  V ,  for  learning  monomials  from  positive  examples  only 
in  the  distribution-free  deterministic  model.  Algorithm  V ,  which  we  here  use  as  a  “black-box” 
subroutine,  has  the  following  properties:  the  algorithm  takes  as  input  positive  c  and  6 ,  and 
a  source  of  positive  examples  of  some  monomial  A/,  each  chosen  randomly  according  to  some 
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fixed,  arbitrary  distribution  D+  on  the  set  of  all  positive  examples.  After  running  for  time 
polynomial  in  1/e,  1/6  and  n,  V  outputs  a  monomial  M  that,  with  high  probability,  has  error 
at  most  e  for  the  positive  examples  of  Af,  and  has  zero  error  for  the  negative  examples.  That 
is,  with  probability  at  least  1  -  6,  M  is  such  that: 

Pr*go+  [M(x)  =  o]  <  c, 

and  also 

M(x)  =  0  =>  M(x)  =  0. 

Our  algorithm  simulates  V  with  K’s  parameter  e  set  to  e/4,  and  6  set  to  6/3.  We  provide  V 
with  a  simulated  oracle  EX'  which  supplies  V  with  only  positively  labeled  examples.  Specifically, 
when  V  requests  an  example,  EX'  draws  examples  from  EX  until  an  example  (x,  1)  is  received; 
this  instance  x  is  then  provided  to  V'. 

Note  that  if  x  is  labeled  positively  by  EX,  then  c(x)  >  0  and  so  M(x)  =  1.  Thus,  V  is  only 
supplied  with  positive  examples.  Note  also  that  the  probability  of  drawing  a  positively  labeled 
example  from  EX  equals  p.  Since  p  >  c/3,  it  follows  that  the  expected  running  time  of  EX'  is 
at  most  0(  1/c). 

The  probability  that  EX'  outputs  some  instance  x  is  just 

£>+(x)  =  Pr(y,»)6EX  [y  =  x  |  6  =  1] 

_  Pr(y, i)eEx[y  =  x  A  b  =  1] 

Pr(y,6)gEX  [b  =  1] 

__  aM(x)-D(x) 

Q  -Pry€o[A/(y)  =  1] 

M(x)D(x ) 

Prv€D  [. A/(y )  =  1] 

=  Pry€ d  [y  =  x  |  M(y)  =  1]. 

With  probability  at  least  1  -  6/ 3,  V  outputs  a  hypothesis  M  which  is  such  that  M(x)  =  0 
whenever  M(x)  =  0.  and 

Prr€o+  [M(x)  =  o]  <  c/4. 

Our  algorithm  next  obtains  an  estimate  a  of  a'  =  Pr(x,*)eEx  [i»  =  1  |  M(x)  =  1  that,  with 
probability  at  least  1  -6/3,  is  such  that  |a'  -  a|  <  c/2.  Such  an  estimate  can  be  derived  from 
a  polynomial-size  sample  since 
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The  algorithm  outputs  the  hypothesis  h  defined  by  (M,d);  we  argue  next  that  h  is,  with 
probability  at  least  1  -  6,  within  e  of  c  on  average. 

As  noted  above,  M  has  the  property  that 


Prr60  [m(x)  =  0  |  M(x)  =  l]  =  Prt6D+  [. M(x )  =  o]  <  c/4. 
Also,  M  logically  implies  M.  Since 


a  =  Pr(r,»)6EX  [*  =  1  I  M (ar)  =  1] 

=  Pr(r,»)€EX  [b  =  1  |  M(x)  =  l]  •  Prl€D  [m(x)  =  1  |  M(x)  =  l]  , 

it  follows  that  a  >  a'  >  a(l  —  c/4)  >  a  —  c/4,  and  so  |a  —  a|  <  3c/4.  Thus,  again  making  use 
of  the  fact  that  M  has  one-sided  error,  it  can  be  seen  that 


EieD  [\h(x)  -  c(x)|]  < 
< 
< 


Pfr€D 

p  rreo 


[a/(x)  =  OA  M(x)  =  1 
[a/(x)  =  0  |  Af(x)  =  l] 


+  \a  -  a|  •  Prre0 

+  |a  -  d) 


\M(x)  =  1 


c. 


Finally,  we  remark  that  the  algorithm  described  in  this  proof  can  be  easily  extended  to 
learn  any  p-concept  c  of  the  form  c  =  ac0  where  a  is  an  unknown  constant  in  [0, 1]  and  c0 
is  a  deterministic  concept  from  some  known  concept  class  for  which  there  exists  an  efficient 
algorithm  that,  like  the  algorithm  V  described  in  the  proof,  requires  positive  examples  only, 
and  outputs  hypotheses  with  one-sided  error  on  the  positive-examples  distribution  only.  For 
instance.  Valiant  [83]  describes  such  an  algorithm  for  learning  k- CNF  (the  class  of  Boolean 
formulas  consisting  of  a  conjunction  of  clauses,  each  a  disjunction  of  at  most  k  literals). 


4-4  Hypothesis  testing  and  expected  loss 

In  this  section,  we  address  the  problem  of  hypothesis  testing  in  the  p-concept  model.  More 
precisely,  given  a  labeled  sample,  and  a  hypothesis  p-concept,  how  do  we  decide  how  good  h  is 
with  respect  to  the  sample?  As  will  be  seen,  the  answer  to  this  question  depends  on  what  our 
goal  is  (a  decision  rule  or  a  model  of  probability). 

We  begin  with  a  description  of  the  learning  framework  that  was  proposed  by  Haussler  [36], 
and  that  extends  the  work  of  Pollard  [71],  Dudley  [23],  Vapnik  [84]  and  others.  In  this  frame¬ 
work,  the  learner  observes  pairs  (x,y)  drawn  randomly  from  some  product  space  A'  x  lr0  ac¬ 
cording  to  some  fixed  distribution.  For  instance,  in  the  p-concept  model,  A"  is  the  domain,  and 
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J  o  =  {0, 1 } ;  the  target  distribution  on  A'  and  the  target  p-concept  together  induce  a  distribution 
on  the  space  X  x  Y0. 

Roughly  speaking,  in  Haussler's  model,  the  learner  tries  to  find  a  hypothesis  that  accu¬ 
rately  predicts  the  y-value  of  a  random  pair  (z,  y),  given  only  the  observed  z-value.  Thus,  the 
hypothesis  h  should  be  such  that  h(x)  is  “near”  y  for  most  random  pairs  (z,y). 

It  is  often  convenient  not  to  restrict  the  range  of  h  to  the  set  Ko;  for  instance,  if  y0  =  {0, 1}, 
then  we  may  want  to  allow  h  to  map  into  [0, 1].  In  general,  then,  we  assume  that  h  is  a  function 
which  maps  X  into  some  set  Y  D  Jo¬ 
in  Haussler’s  model,  the  learner  must  choose  a  hypothesis  from  some  given  hypothesis  space 
H  of  functions  (each  mapping  X  into  Y).  The  goal  of  the  learner  is  to  find  the  hypothesis  from 
H  that  minimizes  the  “discrepancy”  on  random  pairs  (z,  y)  between  the  observed  value  y,  and 
the  predicted  value  h(x).  This  discrepancy  between  y  and  h(x)  is  measured  by  a  real-valued 
"loss”  function.  Formally,  a  loss  function  L  is  a  function  mapping  Y  x  Yo  into  [0, 1].  (The 
extension  of  such  results  to  general  bounded  functions  is  straightforward.)  Thus,  the  formal 
goal  of  the  learner  in  this  framework  is  to  find  a  function  h  6  H  that  minimizes  the  average 
loss  E[Z(/i(z),y)],  where  the  expectation  is  over  points  (z,y)  drawn  randomly  from  A'  x  Jq 
according  to  the  distribution  on  this  product  space. 

Following  Haussler  [36],  we  adopt  the  notation  Lh(x,  y)  =  L{h(x),  y)  for  loss  function  L  and 
hypothesis  h.  Moreover,  we  will  write  E[LA]  to  denote  the  expected  loss  of  h  (with  respect  to 
L)  under  the  unknown  distribution  on  X  x  Y0.  For  a  given  sample  S  =  ((zi,  yi), . . .,  (zm,  ym)) 
of  m  labeled  examples,  we  will  also  be  interested  in  the  empirical  loss  of  h: 

Es[Lfl}  =  -f^Lft(x„yi). 

Note  that  the  empirical  loss  does  not  depend  on  the  underlying  distribution.  Also,  when  the 
sample  is  clear  from  context,  the  subscript  S  is  usually  dropped. 

We  can  cast  the  problems  of  learning  decision  rules  and  models  of  probability  into  this 
general  framework.  As  mentioned  above,  in  our  setting  Jo  =  {0, 1}  since  an  algorithm  only  sees 
{0,  l}-labels.  For  decision-rule  learning,  the  algorithm  outputs  {0,  l}-valued  hypotheses,  and 
thus  Y  =  F0  =  {0, 1}  in  this  case.  Similarly,  for  model-of-probability  learning,  we  assume  that 
hypotheses  have  range  [0, 1],  and  so  Y  =  [0, 1].  The  distribution  on  A”  x  Jo  is  naturally  deter¬ 
mined  by  the  joint  behavior  of  the  target  distribution  D  on  X  and  the  conditional  probabilities 
c(z)  given  by  the  target  p-concept. 

For  finding  the  best  decision  rule,  the  discrete  loss  function  is  most  appropriate,  that  is,  the 
loss  function  Z  given  by  the  rule  Z(y,y')  =  0  if  y  =  y'  and  1  otherwise.  Then  E [Z^]  is  just  the 
probability  that  h  will  misclassifv  a  randomly  drawn  point,  so  minimizing  E [Zh]  is  equivalent 


4-4 
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to  minimizing  the  predictive  error. 

For  finding  a  model  of  probability,  the  quadratic  loss  function  Q{y,y')  =  (y-  y')7  has  some 
nice  properties  which  follow  from  the  following  theorem,  and  that  make  it  the  appropriate 
choice.  Note  that  the  empirical  loss  E(Q>,]  is  the  average  squared-error  statistic  that  is  well 
known  to  researchers  in  pattern  recognition  and  statistical  decision  theory. 

Theorem  4.1  For  any  target  p-concept  c,  target  distribution  D,  and  p-concept  h, 

E [Qh]  -  E [Qc]  =  EtiD  [(h(x)  -  c(x))2]  . 

Proof:  For  fixed  x  €  X,  the  probability  that  x  is  labeled  1  is  c(x),  and  in  this  case,  h  has  loss 

Qfc(x,i)  =  g(M*M)  =  (i-M*))2- 

Likewise,  x  is  labeled  0  with  probability  1  —  c(x),  and  in  this  case,  h  has  loss  (h(x))2.  Thus, 

E[Q*]=  /  [c(x)(l-/i(x))2-Ml-c(x))Mi)2]d^(x)- 

Jx 

Similarly, 

E [Qe]  =  [  [c(x)(l  —  c(x))2  +  (1  —  c(x))c(x)2]  dD(x). 

Jx 

Applying  straightforward  algebra  and  linearity  of  integrals,  it  follows  that 

E[Qk]  -  E[Qe]  =  Jx  [&(*)  -  c(x)]2  dD(x) 

-  Er6D  \{h(x)  -  c(x))2] 


as  desired. 

(AU  these  integrals  are  defined,  assuming  as  usual  that  c  and  h  are  measurable.)  ■ 

Combined  with  Lemma  2.2.  this  theorem  immediately  suggests  a  computationally  efficient 
method  of  choosing  a  good  model  of  probability  from  a  small  (polynomial-size)  class  of  can¬ 
didate  hypotheses.  Suppose  that  a  learning  algorithm  A  has  done  some  initial  sampling  and 
computation  and  has  produced  a  class  7i  of  hypotheses,  one  of  which  is  a  good  model  of  prob¬ 
ability.  Then  A  may  simply  use  the  empirical  loss  E[Q/,]  on  a  large  enough  labeled  sample  (a 
second  sample)  as  an  accurate  estimate  of  the  true  loss  E [Qh\  for  each  h  6  Ft,  and  then  output 
the  hypothesis  with  the  smallest  empirical  loss.  This  hypothesis  h  must  have  near  minimal 
true  loss,  and  so.  by  the  preceding  theorem  and  our  assumption  that  H  contains  a  good  model 
of  probability,  h  must  itself  be  a  good  model  of  probability.  The  remainder  of  this  section 
describes  an  example  of  an  efficient  learning  algorithm  that  employs  this  approach. 
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4-4.1  Probabilistic  concepts  of  k  relevant  variables 

For  a  p-concept  c  on  n  Boolean  variables,  we  say  that  variable  x<  is  relevant  if  c(x)  ^  c(y )  for 
two  vectors  x  and  y  which  differ  only  in  their  ith  bit.  We  say  that  c  is  a  p-concept  of  k  relevant 
variables  if  c  has  only  k  relevant  variables.  Such  p-concepts  are  good  models  of  situations  in 
which  there  are  a  small  number  of  variables  whose  settings  determine  the  probabilistic  behavior 
in  a  possibly  very  complicated  manner,  but  most  variables  have  no  influence  on  this  behavior. 

Theorem  4.2  Let  k  >  1  be  fixed.  Then  the  class  of  all  p-concepts  of  k  relevant  variables  is 
polynomially  leamable  with  a  model  of  probability. 

Proof:  For  any  set  /  C  {1, . . .,  n},  we  say  that  two  assignments  x  and  y  in  {0, 1}"  are  equivalent 
with  respect  to  I  if  x,  =  yt  for  all  i  £  I.  Then  this  equivalence  relation  partitions  {0,  l}n  into 
2(/|  equivalence  classes,  called  I -blocks. 

Let  c  be  the  target  p-concept,  and  let  /.  be  the  set  of  indices  of  the  k  relevant  variables  of  c. 
Our  algorithm  begins  by  drawing  a  sample  Sj  of  size  mi  =  0((2*/c3)  •  log(2*/<5)).  For  each 
of  the  (")  sets  I  oik  indices,  and  for  each  /-block  B,  our  algorithm  obtains  from  Si  an  estimate 
Pb  of  Pb  =  Pr(x,*)e£X  [&  =  1  |  x  £  B].  A  hypothesis  hj  is  then  defined  by  the  rule  ht{x)  =  pB 
for  x  £  B. 

By  our  choice  of  mi,  it  follows  from  Chernoff  bounds  (Lemma  2-3.6)  that,  with  probability 
at  least  1  -  6/2,  a  sample  Si  is  chosen  for  which  | pB  -  pB\  <  e/4  for  every  /.-block  B  which 
satisfies  Prr€p  [x  £  5]  >  c/2t+2.  This  implies  that,  with  high  probability, 

Er€D  (t*/.(^)  ~  C(x)|]  =  Prr€o  [x  £  B]  ■  I pB  -  c(x)|  <  e/2 

B 

where  the  sum  is  taken  over  all  /.-blocks  B.  This  bound  follows  from  the  fact  that  c(x)  =  pB 
for  x  £  B.  and  by  breaking  the  sum  into  two  parts  according  to  whether  Prr6o  [r  £  B ]  exceeds 
or  does  not  exceed  e/2t+2. 

Next,  our  algorithm  tests  each  hypothesis  ht\  that  is,  an  estimate  E[QA/]  is  found  from  a 
sufficiently  large  sample  S2  that,  with  high  probability,  is  within  c/4  of  E[Q/,,].  Specifically,  this 
will  be  the  case  with  probability  at  least  1  -  b/2  for  all  hypotheses  hi  if  we  choose  a  sample  52 
of  size  0((  1/e2)  •  log(n*/6)).  The  algorithm  outputs  the  hypothesis  h  =  ht  with  the  minimum 
empirical  loss.  Then,  applying  Theorem  4.1,  we  have: 

Er€D((/i(x)-c(x))2]  =  E(C?a]  -  E«?c] 

<  ±[Qk)-E[Q'}  +  (/4 

<  E[0fcJ-E|<?e]  +  c/4 


5 
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<  E[Qhl']-E[Qe)  +  e/2 
=  Er€D  [(/i;.(i)-c(i))2] +c/2  <  e. 

Applying  Lemma  2.2,  it  follows  that  this  efficient  algorithm  can  be  used  to  learn  a  good  model 
of  probability.  ■ 


4-5  Uniform  convergence  methods 

When  is  minimization  of  the  empirical  loss  over  a  hypothesis  class  Ti  sufficient  to  insure  good 
learning  of  a  decision  rule  or  a  model  of  probability?  Note  that  even  with  computational  issues 
set  aside,  the  hypothesis-testing  methods  of  the  preceding  section  fall  apart  in  the  case  of  an 
infinite  class  Ti:  directly  estimating  the  empirical  loss  of  each  h  €  H  separately  would  take  an 
infinite  number  of  examples  and  an  infinite  amount  of  time.  What  is  required  is  a  characteriza¬ 
tion  of  the  number  of  examples  required  for  uniform  convergence  of  empirical  losses  to  expected 
losses  analogous  to  that  provided  by  the  VC-dimension  in  the  case  of  deterministic  concepts. 
This  is  particularly  pressing  in  our  model  of  p-concepts,  where  even  when  the  domain  is  finite 
(e.g.  {0,1}"),  the  target  p-concept  class  is  usually  infinite  due  to  the  different  values  allowed 
for  the  probabilities.  We  now  turn  to  a  discussion  of  such  uniform  convergence  techniques 
applicable  to  p-concept  classes. 

Haussler  [36],  Pollard  [71]  and  others  have  described  the  combinatorial  dimension  of  a  class 
of  real- valued  functions  T ',  and  have  shown  that  the  combinatorial  dimension  is  a  powerful  tool 
for  obtaining  uniform  convergence  results.  Before  defining  the  combinatorial  dimension,  we 
state  its  most  important  property  for  us.  namely,  that  it  allows  us  to  upper  bound  the  size  of 
a  sample  sufficient  to  guarantee  uniform  convergence  of  empirical  estimates  for  the  entire  class 
of  functions.  The  following  theorem  is  adapted  directly  from  Haussler’s  Corollary  2  [35]. 

For  a  hypothesis  space  Ti  and  loss  function  L ,  we  define  L-h  =  [Lh  :  h  €  Ti). 


Theorem  5.1  Let  Ti  be  a  hypothesis  space  of  functions  mapping  X  into  Y  which  satisfies 
certain  “ permissibility ”  assumptions  (see  Haussler’s  paper).  Let  D  be  a  probability  distribution 
on  X  x  Y0 ,  let  L  :  Y  x  Y0  —  [0. 1]  be  a  loss  function,  let  d  <  oo  be  the  combinatorial  dimension 
of  L-h,  and  let  S  be  a  sample  of  m  points  from  X  x  Y0  chosen  randomly  according  to  D.  Assume 


m 


Then 


>  m{d,  e,6)  =  ZH  (2<fln  +  In  (jj^j  . 

Pr[3/i  €  H  :  |e[L*]  -  E[L„]|  >  <]  <  6. 


where  the  probability  is  taken  over  the  random  generation  of  S  according  to  D. 
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Theorem  5.1  suggests  the  following  canonical  algorithm  for  finding  a  hypothesis  from  H  with 
near  minimum  loss,  when  the  combinatorial  dimension  d  is  finite:  take  a  sample  5  of  at  least 
m(d,  c/2, 6)  labeled  examples  from  the  oracle  EX ,  and  output  any  h  €  7i  that  minimizes  the 
empirical  loss  E[T/,]  with  respect  to  5.  Then  the  theorem  guarantees  that  the  output  hypothesis 
has  true  loss  within  e  of  the  best  possible  with  probability  at  least  1  —  6.  This  of  course  ignores 
the  computational  problem  of  actually  finding  such  a  hypothesis. 

We  can  apply  Theorem  5.1  to  our  learning  problems  by  determining  what  the  combinatorial 
dimension  is  for  each  of  the  loss  functions  Z  and  Q.  For  the  loss  function  Z ,  Haussler  points 
out  that  the  combinatorial  dimension  is  just  the  VC-dimension  of  the  hypothesis  class.  That  is, 
if  H  is  a  hypothesis  space  of  functions  with  range  {0, 1},  then  the  combinatorial  dimension  of 
the  set  of  functions  Z*  is  just  the  VC-dimension  of  7t.  Thus,  the  number  of  examples  needed 
for  decision-rule  learning  is  bounded  by  the  VC-dimension  of  the  space  of  hypotheses  used  by 
the  learning  algorithm.  (That  the  VC-dimension  can  be  used  in  this  manner  was  also  observed 
by  Blumer  et  al.  [14].) 

For  example,  consider  the  problem  of  learning  a  decision  rule  for  an  increasing  function 
over  R.  Note  that  the  best  decision  rule  for  such  a  p-concept  is  always  of  the  form  ha(x)  =  1 
for  x  >  a,  and  0  otherwise,  for  some  a.  Thus,  a  natural  and  efficient  decision-rule  learning 
algorithm  for  this  problem  is  the  following:  draw  a  “large”  sample  from  EX.  Then,  for  each 
Xi  in  the  sample,  determine  the  empirical  predictive  error  of  hypothesis  hx> ,  that  is,  the  frac¬ 
tion  of  points  in  the  sample  whose  labels  disagree  with  hXt.  Finally,  output  the  hx>  with  the 
minimum  empirical  predictive  error.  Since  the  VC-dimension  of  this  class  of  decision  rules  is  1, 
it  follows  from  Theorem  5.1  that  a  polynomial- size  sample  suffices  to  insure  the  correctness  of 
this  algorithm. 

For  the  problem  of  learning  a  model  of  probability,  we  introduce  the  quadratic  loss  dimen¬ 
sion.  Let  H  be  a  class  of  p-concepts  over  domain  X.  Let  T  =  {(ij, r^, . .  ,,(xrf,  rd)}  be  a  set 
of  d  pairs,  where  each  x<  €  X  and  each  r<  €  [0, 1].  We  say  that  H  shatters  T  if  for  every  string 
v  €  {0,  l}d  there  is  a  p-concept  h  €  H  such  that  for  1  <  i  <  d,  if  t>,  =  0  then  ft(x,)  <  r*  and 
if  Vj  =  1  then  fi(x<)  >  t\.  Thus  on  the  points  Xi,...,xrf  the  class  Ti  exhibits  all  2d  possible 
“above-below”  behaviors  with  respect  to  the  rj.  A  geometric  interpretation  of  this  definition 
is  to  regard  (r j , . . .,  rd)  as  the  origin  of  a  coordinate  system  in  d-dimensional  Euclidean  space; 
then  H  shatters  T  if  the  set  {(hfix ),...,  h(xd)) :  h  6  }  intersects  all  2d  orthants  of  the  coordi¬ 

nate  system.  For  this  reason  we  will  sometimes  refer  to  (r,, . . .,  r,,)  as  the  origin  of  shattering. 
The  quadratic  loss  dimension  of  a  p-concept  class  W,  denoted  QD( 7f),  is  defined  as  the  largest 
value  of  d  such  that  there  is  some  set  T  of  d  pairs  that  is  shattered  by  H;  if  no  such  d  exists, 
then  QDCH)  is  infinite. 

The  quantity  QIXfH)  is  in  fact  just  the  combinatorial  dimension  of  the  class  H;  but  since 
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we  evaluate  hypotheses  via  the  quadratic  loss,  we  are  more  concerned  with  the  combinatorial 
dimension  of  the  associated  class  Qn  of  loss  functions.  The  following  theorem  states  that  the 
combinatorial  dimension  of  Qh  is  also  equal  to  QD(H).  Despite  these  equivalences,  we  choose 
to  use  the  notation  QD('H)  to  emphasize  that  in  more  general  settings,  the  combinatorial 
dimension  of  the  class  of  loss  functions  may  be  quite  different  than  that  of  the  underlying 
hypothesis  class  H. 

Theorem  5.2  For  any  p-concept  class  H,  the  combinatorial  dimension  of  Qh  is  equal  to  the 
quadratic  loss  dimension  of  'H. 

Proof:  Let  {(x,,  r,)}?=1  shatter  H.  For  all  v  £  {0,  l}d,  there  exists  h  £  W  such  that 

sig n(r<  -  h{Xi))  =  t>i, 

where  sign(y)  =  1  if  y  >  0  and  sign(y)  =  0  if  y  <  0.  Since  all  quantities  are  non-negative, 
h(x,)  <  r{  if  and  only  if  Qj,(z,,0)  =  (/i(ij))2  <  r2.  Thus, 

v{  =  signer]  -  Qh(zi,0)), 

and  so  {((x<,0),  shatters  Q-h .  Thus,  the  combinatorial  dimension  of  Qh  is  at  least 

QW)- 

Conversely,  let  {((z.A),  r,)}ld_1  shatter  Qh ■  Since  d  is  finite,  we  can  assume  without  loss  of 
generality  that  the  r,'s  are  chosen  so  that  strict  inequality  holds  in  the  definition  of  combinatorial 
dimension,  i.e.,  for  all  v  £  {0,  l}11  there  exists  h  £  7i  such  that  Qh{xi,b,)  <  r,  if  =  1  and 
Qh(Xi,bi)  >  r,  if  Vi  -  0.  Then 

sign(ri  -  (^(x^h,))  =  sign(ri  -  (h(x.)  -  b,)2)  =  stgrif^/r"-  \h(xi)  -  6i|) 

which  equals  sign(y/Fi  -  h(Xj))  if  b,  =  0.  and  equals  sign(h(xi)  -  (1  -  y/ri))  =  1  -  sign((  1  - 
y/Fi)-h(xi))\{  b,  =  1.  It  follows  that  {(x;,|6i  —  shatters  H,  and  thus  the  combinatorial 

dimension  of  Qh  is  at  most  QD{H).  ■ 

Note  that  the  second  part  of  the  proof  of  this  theorem  relies  critically  on  the  fact  that,  in 
the  p-concept  model,  instances  are  only  {0,  l}-labeled. 

4-5.1  Linear  function  spaces 

Armed  with  the  definition  of  quadratic  loss  dimension  and  the  sample  size  upper  bounds  pro¬ 
vided  by  Theorem  5.1.  we  can  now  seek  efficient  algorithms  that  work  by  directly  minimizing 
the  quadratic  loss  over  an  infinite  class  of  functions.  This  is  the  approach  taken  in  our  next 
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theorem.  For  any  domain  A',  let  /,  :  A'  — ►  R,  1  <  i  <  d  be  any  d  functions,  and  let  C(fi , . . /<*) 
denote  the  class  of  all  p-concepts  of  the  form  c(x)  =  aifi(x)  for  a,  €  R,  where  we  assume 
that  the  /,  and  a,  are  such  that  c(x)  €  [0, 1]  for  all  x  €  X.  We  describe  below  an  algorithm 
that  learns  a  model  of  probability  for  p-concepts  in  the  class  C(fi, .  ■ .,  fd)-  The  running  time 
of  this  algorithm  is  polynomial  in  the  usual  parameters,  d,  and  the  time  needed  to  evaluate  the 
functions  /,. 

This  result  can  be  applied  to  prove  the  polynomial  learnability  of  several  natural  p-concept 
classes.  For  instance,  consider  the  generalization  of  deterministic  disjunctions  in  which  the 

target  p-concept  has  the  form  c(x)  =  (x,-,  + - 1-  x ,,)/<,  where  the  xij  are  Boolean  variables 

chosen  from  and  +  denotes  ordinary  addition.  Thus,  such  a  p-concept  is  “more 

positive”  on  vectors  x  6  {0,1}"  that  have  many  of  the  relevant  variables  set  to  1.  Such  a 
p-concept  class  is  clearly  of  the  form  required  by  Theorem  5.3,  and  so  is  polynomially  learnable 
with  a  model  of  probability. 

As  a  more  subtle  application,  consider  a  class  of  p-concepts  over  {0, 1}"  that  are  partially 
specified  by  a  canonical  positive  example  z  6  {0,1}”.  We  wish  to  model  a  setting  in  which 
z  is  the  prototypical  positive  instance,  and  those  examples  “most  like”  z  are  more  likely  to 
be  labeled  positively.  Thus,  the  target  p-concept  might  have  the  form  c(x)  =  a  -  b  •  d(x,z ) 
where  d(x,z)  denotes  the  Hamming  distance  and  a  and  b  are  positive  real-valued  coefficients 
such  that  c  is  maximized  at  z  and  is  always  in  the  range  {0,1].  Here  the  p-concept  class  C  is 
obtained  by  ranging  over  the  choices  of  the  prototype  z  and  the  coefficients  a  and  6,  and  the 
“decay  function,”  which  specifies  the  rate  at  which  vectors  further  away  from  the  prototype  fail 
to  exemplify  the  concept,  is  linear.  It  is  not  difficult  to  show  that  each  function  in  C  can  in  fact 
be  written  as  a  weighted  linear  sum  of  the  variables  xl? . .  .,xn,  so  C  is  polynomially  learnable 
with  a  model  of  probability. 

Finally,  we  remark  that  Theorem  5.3  can  be  applied  to  learn  so-called  “t-transform  func¬ 
tions”  considered  by  Mansour  [60]. 

Theorem  5.3  For  any  set  of  d  known  computable  functions  fi  :  X  — ►  R,  1  <  i  <  d,  the 
class  C(fi, ...,  fd)  is  learnable  with  a  model  of  probability.  Specifically,  there  exists  a  learning 
algorithm  for  this  class  whose  running  time  is  polynomial  in  l/e,  1  /6,  I/7,  d  and  the  maximum 
time  needed  to  evaluate  any  of  the  functions  fi . 

Proof:  Given  c,  6  >  0.  our  algorithm  draws  a  sample  of  size  m  =  \m(d,  e/2, 6)]  as  given 

by  Theorem  5.1,  and  attempts  to  find  the  choice  of  coefficients  a  . . ad  that  minimizes  the 

quadratic  loss  over  the  sample.  This  can  be  done  using  a  standard  least-squares  approximation. 
For  instance,  this  can  be  done  directly  by  differentiating  with  respect  to  each  unknown  coefficient 
a<  the  expression  [(£?=i  ~  ^;]  (where  {(ij, 6j)}J”_,  is  the  labeled  sample),  and 
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setting  the  resulting  partial  derivative  to  0.  This  yields  a  system  of  d  linear  equations  in  the  d 
variables  a,  that  is  of  a  special  form  and  that  can  be  solved  using  standard  techniques.  Cormen, 
Leiserson  and  Rivest  [19,  Chapter  31]  describe  in  detail  how  this  can  be  done  efficiently;  see 
also  Duda  and  Hart  [22]. 

Let  di,...,dj  be  the  resulting  solution,  and  let  h0  =  Yli=i  “•/<•  Note  that  h0  may  not  be 
bounded  between  0  and  1,  and  so  may  not  be  in  C  =  C(/i,  • . -, /<<)•  We  show  below  how  to 
handle  this  difficulty. 

For  any  real- valued  function  /,  let  clamp(f)  denote  the  function  obtained  by  “clamping”  / 
between  0  and  1;  that  is,  clamp(f)  =  g  o  /  where  g  :  R  — *  R,  and  g(x)  is  defined  to  be  0  if 
x  <  0,  x  if  0  <  x  <  1,  and  1  if  x  >  1.  Let  H  =  j clamp  <*,•/<)  :  a.  €  R}.  Our  algorithm 
outputs  the  hypothesis  h  =  clamp(ho).  Clearly  h  is  in  H,  as  is  the  target  c. 

Dudley  [23]  shows  that  a  d-dimensional  linear  function  space  has  combinatorial  dimension  d. 
(This  is  reproved  by  Haussler  [35],  Theorem  4.)  Combined  with  Haussler’s  Theorem  5  (which 
concerns  the  combinatorial  dimension  of  families  of  functions  constructed  in  the  same  way 
as  H),  this  immediately  implies  QD{Ti)  <  d.  Thus,  by  Theorem  5.1  and  our  choice  of  m, 
with  probability  at  least  1  -  6,  |E[Qv]  -  E[Qv]|  <  c/2  for  every  h'  £  H.  Also,  note  that 
±[Qh]  <  E[Qho]  since  all  instances  in  the  sample  are  {0,  l}-labeled,  so  clamping  the  hypothesis 
only  improves  its  performance.  Thus,  with  probability  at  least  1  -  6,  we  have: 


E tzd  [(h(x)  -  c(x))2] 


=  E[<?fc]-E[Qe] 

<  E[Q„]  -  E[C?c] -M/2 

<  E[Qh0}-E[Qc}  +  (/2 

<  E[Qe)-E[Qc]  +  e/2<e. 


As  usual,  Lemma  2.2  can  be  applied  to  convert  this  algorithm  into  one  that  learns  a  good 
model  of  probability  for  this  class.  ® 


4-6  A  lower  bound  on  sample  size 

Theorem  5.1  provides  a  kind  of  general  upper  bound  on  the  sample  size  required  for  learning 
a  model  of  probability.  We  turn  now  to  the  problem  of  lower  bounds  on  sample  complexity  in 
this  framework.  For  this,  we  need  to  introduce  a  refined  notion  of  shattering. 

Let  H  be  a  class  of  p-concepts  over  domain  A’.  Let  T  =  {(x^  n), . .  .,(xd,  r^)}  be  a  set  of 
d  pairs,  where  each  x,  £  X  and  each  r,  £  [0, 1].  For  w  >  0,  we  say  that  H  w-shatters  T  if  for 
every  string  v  £  {0,  l}d  there  is  a  p  concept  h  £  H  such  that  for  1  <  t  <  d,  if  »j  =  0  then 
h{ii)  <  r,  -  w  and  if  v,  =  1  then  h(z,)  >  n  +  w.  Thus,  in  addition  to  T  being  shattered  by 
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H  we  require  that  there  be  a  separation  of  width  w  between  r,  and  each  h(xi );  we  call  w  the 
width  of  shattering.  Note  that  if  H  has  quadratic  dimension  at  least  d  then  there  always  exists 
some  w  >  0  such  that  some  set  of  d  pairs  over  X  x  [0, 1]  is  tu-shattered. 

Based  on  this  stronge.  notion  of  shattering,  we  can  now  prove  the  following  lower  bound 
on  sample  complexity  in  our  model.  This  lower  bound,  combined  with  Theorems  5.1  and  5.2, 
shows  that  when  the  quadratic  loss  dimension  is  finite  it  characterizes  the  sample  size  required 
for  learning  with  a  model  of  probability  (that  is,  the  bound  obtained  by  applying  Theorem  5.1 
is  tight  within  a  polynomial  factor  of  1/f  and  1/6).  This  lower  bound  may  also  be  of  theoretical 
interest,  since  in  Haussler’s  general  learning  framework  [36]  the  combinatorial  dimension  is  only 
an  approximate  upper  bound  on  the  so-called  covering  number,  which  is  directly  used  to  obtain 
sample-size  bounds. 

Theorem  6.1  Let  C  be  a  p-concept  class  that  w-shatters  a  set  of  cardinality  d.  Then  for 
<  w  and  (  +  £  <  1/8,  any  algorithm  for  learning  C  with  a  model  of  probability  requires  at  least 

[d(lge)/8J  =  Q(d)  examples. 

• 

Proof:  Our  proof  is  based  on  the  analogous  lower  bound  proof  given  by  Blumer  et  al.  [14] 
for  learning  deterministic  concepts.  However,  the  analysis  is  more  involved  in  the  probabilistic 
case. 

Let  T  =  {(*i,ri), . .  rd)}  be  ie-shattered  by  the  p-concept  class  C.  Let  C0  C  C  be  any 
fixed  subclass  of  C  such  that  Co  ^-shatters  T  and  |C0|  =  2d.  Let  A  be  a  learning  algorithm  for  C 
taking  m  examples  for  the  given  choices  of  e,  6  and  7,  and  let  hs  denote  the  hypothesis  output 
by  A  on  input  a  labeled  sample  S  of  size  m.  (We  assume  for  simplicity  that  A  is  deterministic 
—  the  proof  is  easily  modified  to  handle  randomized  algorithms.) 

Let  the  target  distribution  D  be  uniform  over  the  points  xt,...,xd.  We  define  a  weak 
error  measure  e(c,  S)  for  target  c  £  Co  and  input  sample  S  as  follows:  the  error  e,(c,  5)  at 
Xi  is  defined  to  be  0  if  c(ii )  and  hs(x,)  are  either  both  less  than  r*,  or  both  greater  than  ry, 
otherwise,  e,(c,  S)  =  1.  Then  e  is  just  the  average  of  the  e;’s: 


Note  that  if  e(c,  S)  >  c,  then  hs  cannot  be  an  (e,  in)-good  model  of  probability  for  c  since  if 
c(x,)  and  /is(x,)  are  not  uon  the  same  side”  of  r,,  then  they  differ  by  more  than  u>. 

This  definition  allows  us  to  examine  the  expectation 


E5[e(c,  5)]  =  Pr[S|c] '  e(c<  S) 

s 
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which  is  taken  over  5  drawn  randomly  according  to  D  and  labeled  randomly  according  to  c. 
and  Pr[5jc]  is  the  conditional  probability  that  5  is  generated  by  D  and  c. 

We  will  also  be  interested  in  the  expectation  of  e(c,  5)  when  both  c  €  Co  is  generated 
uniformly  at  random  and  5  is  generated  according  to  the  randomly  chosen  c  and  the  target 
distribution  D : 

Ee,s[e(c,  =  5)‘ 

z  s  cec0 

We  wish  to  lower  bound  e(c,  5)  for  most  of  the  p-concepts  in  C0.  For  any  sample  5 ,  let 
Cs  =  {c  €  C0  :  e(c,S)  <  1/4}.  Then  for  c  €  C0  -  Cs ,  e(c,5)  >  1/1.  and  so  we  obtain  the  lower 
bound 

E..s[e(c,S)]>-^S:  £  Pr[SM=5i;(££Pr[S|c]-££Pr[S|c] 

S  e€C0-Cs  V  s  e€Co  5  c£Cs 


Now 

£  £  Pr[S|c]  =  £  £  Pr(S|ej  =  |C0|  = 

S  c€Co  S 

since  for  any  c,  =  1-  For  the  second  term  of  the  expectation,  we  will  upper  bound 

the  total  number  of  possible  samples  5.  the  cardinality  |Cs|,  and  the  value  of  Pr[S|c].  First, 
the  number  of  possible  samples  5  is  at  most  (2 d)m,  since  each  of  the  d  points  may  appear  with 
either  label  and  the  number  of  examples  in  S  is  m.  To  bound  |C5|,  consider  drawing  a  p-concept 
c  uniformly  at  random  from  the  class  Co.  By  choice  of  Co,  the  probability  that  e(c,  S)  <  1/4  is 
bounded  by  the  probability  of  fewer  than  dj 4  heads  occurring  in  d  flips  of  a  fair  coin.  Thus, 
applying  Chernoff  bounds  (Lemma  2-3.6),  we  conclude 


|C5|  <  \Co\-e~dl*  =  2U-Oo,d 


where  a0  =  (lge)/8.  Finally,  Pr[5|c]  <  1  /dm  since  if  we  ignore  the  labels  on  the  points  in  S, 
the  probability  of  any  particular  set  of  m  points  being  generated  by  the  target  distribution  D 
is  at  most  1  fdm. 

Piecing  together  these  bounds,  we  may  now  write 

Ee.s[e(c,  5)]  >  ^  (2<<  "  W  '  '  d~m)  =  ~  2m"O0‘,)  • 

Thus,  if  m  <  a0d- 1  then  Ee,5[e(c,  5)]  >  1  /8.  From  this  it  follows  that  for  some  fixed  c0  6  C0 , 
Es[e(c0, 5)3  >1/8  where  the  expectation  is  taken  over  5  drawn  according  to  D  and  labeled 
according  to  c0.  By  assumption  A  learns  with  a  model  of  probability.  Thus,  with  probability 
at  least  1  -  6.  a  sample  5  is  chosen  such  that  hs  is  an  (e,w)-good  model  of  probability.  As 
noted  above,  in  such  a  case,  e(c,  5)  <  (.  Thus,  Es[e(c0,5)]  <  (1  —  S)(  +  6  <  e  +  6.  Therefore, 
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if  <  +  6  <  1/8  then  m  is  at  least  |a0dj,  proving  the  theorem.  ■ 

Any  theorem  giving  a  sample-size  lower  bound  must  incorporate  the  width  of  shattering;  for 
instance,  ours  holds  only  for  7  <  w.  To  see  that  this  is  necessary,  note  that  the  p-concept  class 
of  all  functions  mapping  X  into  {1/2  -  w,  1/2  +  in}  shatters  all  of  A,  but  for  7  >  w  this  class 
can  be  learned  with  no  examples  with  the  hypothesis  h(x)  =  1/2.  A  more  natural  example  is 
provided  by  the  non-decreasing  functions  of  Section  4-3.1.  Here  the  quadratic  loss  dimension 
is  infinite,  but  we  have  an  efficient  learning  algorithm.  An  interesting  open  problem  is  to  give 
improved  general  upper  bounds  on  sample  size  that  incorporate  the  width  of  shattering. 

4-7  Occam’s  Razor  for  general  loss  functions 

In  this  section,  we  present  a  generalized  form  of  Occam’s  Razor  [13]  applicable  to  the  mini¬ 
mization  of  bounded  loss  functions,  and  in  particular  to  learning  p-concepts  with  a  model  of 
probability  or  a  decision  rule.  Here  we  have  several  motivations:  first,  it  is  of  philosophical 
interest  to  investigate  the  most  general  conditions  under  which  learning  is  equivalent  to  some 
form  of  data  compression;  second,  as  in  the  Valiant  model,  we  hope  that  Occam’s  Razor  will 
help  isolate  and  simplify  the  probabilistic  analysis  of  learning  algorithms;  third,  Occam’s  Razor 
may  be  easier  to  apply  than  uniform-convergence  methods  in  the  case  that  the  combinatorial 
dimension  is  unknown  or  difficult  to  compute;  and  fourth,  Occam’s  Razor  may  give  better 
sample-size  bounds  than  direct  analyses. 

An  Occam  algorithm  for  hypothesis  class  H  over  parametrized  domain  X,  with  respect  to  a 
loss  function  L  :  Y  xlo-  [0, 1]  is  a  polynomial- time  algorithm  A  that  takes  as  input  a  labeled 
sample  5  €  (A„  x  F0)m,  and  outputs  a  hypothesis  h  with  the  properties  that: 

1.  E[Zh]  -  infh'gx  E[£h>]  <  r  =  r(n.m)  =  nam~a  for  some  constants  a  >  0,  and  a  >  0;  and 

2.  h  can  be  represented  by  a  string  over  the  finite  alphabet  {0,1}  of  encoded  length  l  = 
i{n,m)  =  for  some  constants  b  >  0  and  (3  <  1. 

Thus,  as  in  the  non-probabilistic  setting,  we  require  an  Occam  algorithm  to  perform  some 
kind  of  data  compression,  i.e.,  to  output  a  hypothesis  significantly  smaller  than  the  given  sample. 
Furthermore,  the  output  hypothesis  must  come  close  to  minimizing  the  empirical  loss  on  the 
sample  over  the  entire  hypothesis  space  H. 

Theorem  7.1  Let  A  be  an  Occam  algorithm  as  described  above.  Let  S  be  a  labeled  sample  of 
size  m  generated  according  to  some  target  p-concept  c.  Let  h  be  the  result  of  running  A  on  S . 
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Assume  m  is  so  large  that  r  <  c/4  and  2(2*  4-  l)e_'3m/8  <  6.  Then 


Pr[E[Xfc]-  mfwE[Lv]>f]<*. 


In  particular,  this  will  be  the  case  if 


m  >  max 


{(?)"•( 


16(1112)71* 


b\ »/(»-« 


16ln(4/6) 


Proof:  The  proof  is  analogous  to  that  of  Blumer  et  al.  [13]. 

Let  Ha  be  the  set  of  (at  most)  2l  hypotheses  which  might  potentially  be  output  by  A.  Let 
h,  6  H  be  such  that  E[L/,.]  <  infvew  E[Lh>]  +  c/4.  Then,  by  Chernoff  bounds  (Lemma  2-3.6), 
the  probability  that  either  E]!,,-]  >  E[Lv]  +  c/4  for  any  h'  €  HA ,  or  that  E[Lj,.]  >  E[Lh J  +  c/4 
is  at  most  2(2*  +  l)e-f3m/8  <  6.  So,  with  probability  at  least  1  —  6, 


E[L„]  <  E[L„]  +  c/4 

<  ^Eflvl  +  c/2 

<  E[X.fc.]  +  e/2 

<  E[Xfc.]  +  3c/4 

<  inf  E[Zv]  +  e. 

We  show  next  that  the  stated  bound  on  m  is  sufficient.  Clearly,  from  the  first  bound  on 
m,  t  <  e/4.  Further,  from  the  second  bound,  we  have  that  l  =  nhmp  <  (lge)t2m/16.  Thus, 
2(2*  +  l)e-<3m/8  <  4  •  2*e~'3m/8  <  4  •  <  6  by  the  last  bound  on  m.  ■ 

As  an  example,  Theorem  7.1  can  be  applied  to  the  problem  of  learning  p-concepts  with  k 
relevant  variables.  Essentially,  the  algorithm  given  in  Theorem  4.2  can  be  modified  so  that  a 
single  initial  sample  of  size  m  can  be  used  for  all  of  the  estimates  made  by  that  algorithm. 
Note  that  a  hypothesis  output  by  this  algorithm  can  be  represented  by  the  names  of  k  of  the 
variables,  plus  the  probabilities  for  the  2*  equivalence  classes.  Each  name  requires  lgn  bits, 
and  moreover,  each  probability  is  a  rational  number  (being  an  empirical  probability  estimate) 
that  requires  only  O(logm)  bits;  thus,  the  hypothesis  has  size  0(klogn  +  24logm).  Finally, 
it  can  be  shown  that  the  hypothesis  has  the  minimum  empirical  loss  over  the  entire  class  of 
p-concepts  with  k  relevant  variables.  Thus,  Theorem  7.1  can  be  used  to  easily  determine  an 
appropriate  sample  size  for  this  algorithm. 

Note  that  Theorem  7.1  is  only  applicable  to  algorithms  which  output  hypotheses  over  a  finite 
alphabet.  However,  the  theorem  can  be  extended  to  apply  to  other  algorithms  in  a  manner 
similar  to  the  approach  taken  by  Littlestone  and  Warmuth  [59]  in  the  Valiant  model.  (See 
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also  Section  2-6.2.)  The  basic  idea  is  to  allow  the  learning  algorithm  to  output  hypotheses 
that  can  be  represented  over  the  alphabet  5  U  {0,1},  where  5  is  the  given  sample.  That  is, 
the  representation  of  the  hypothesis  may  include  individual  examples  from  the  sample  itself. 
For  example,  the  hypothesis  output  by  the  algorithm  for  learning  increasing  functions  with  a 
decision  rule  (Section  4-5)  can  be  represented  by  a  single  example  from  the  sample,  despite  the 
fact  that  this  hypothesis  would  require  an  infinite  number  of  bits  to  represent  over  a  fixed  finite 
alphabet.  Thus,  this  alternate  form  of  Occam’s  Razor  can  be  used  to  provide  a  good  sample-size 
bound.  Similarly,  the  algorithm  given  in  Theorem  3.1  (slightly  modified)  for  learning  increasing 
functions  with  a  model  of  probability  can  be  cast  in  this  light  as  an  Occam  algorithm. 


4-8  Conclusions  and  open  problems 

In  this  chapter,  we  have  explored  an  extension  of  Valiant’s  model  that  incorporates  the  uncer¬ 
tainty  inherent  to  many  real-world  learning  problems.  We  have  focused  primarily  on  techniques 
for  the  design  of  efficient  algorithms  in  this  model. 

Naturally,  we  would  like  to  find  efficient  algorithms  for  much  broader  classes  of  p-concepts 
than  the  simple  classes  considered  here.  For  example,  can  the  algorithm  of  Section  4-3.2  be 
extended  to  learn  arbitrary  (not  necessarily  w-converging)  probabilistic  decision  lists?  As  is 
often  the  case  in  the  deterministic  Valiant  model,  sample  size  is  not  the  problem:  from  Theo¬ 
rem  7.1,  one  can  fairly  easily  derive  a  polynomial  sample-size  bound  for  learning  this  class  using 
a  computationally  inefficient  Occam  algorithm  that,  given  a  sample,  finds  the  decision  list  with 
the  minimum  quadratic  loss  by  trying  all  permutations  of  the  list  order.  The  problem  here  is 
computational:  how  can  we  learn  this  class  efficiently?  The  development  of  further  techniques 
for  learning  p-concepts  is  a  vitally  important  direction  for  further  research. 

Although  the  p-concept  model  captures  realistic  aspects  of  many  learning  problems,  it  might 
still  be  criticized  for  its  assumption  that  the  target  p-concept  belongs  to  an  a  priori  known  class 
of  p-concepts.  More  realistic  is  a  so-called  agnostic  learning  model  in  which  the  target  p-concept 
is  any  function  from  A'  into  (0, 1],  and  the  learner’s  goal  is  to  find  the  best  hypothesis  from  some 
fixed  space  of  hypotheses.  This  is  actually  the  framework  assumed  by  Haussler  [36]  in  deriving 
his  sample-size  bounds.  A  few  of  the  algorithms  described  in  this  chapter  are  effective  agnostic 
learners,  such  as  the  algorithm  of  Theorem  4.2  for  p-concepts  with  k  relevant  variables.  An 
important  open  problem  is  the  extension  of  other  algorithms  to  agnostic  learning.  For  instance, 
do  there  exist  efficient  agnostic  algorithms  for  probabilistic  decision  lists  with  ^-converging 
probabilities  (Section  4-3.2),  or  for  linear  function  spaces  (Section  4-5.1)? 

It  is  also  important  to  continue  to  develop  a  theoretical  foundation  for  p-concept  learning. 
For  instance,  are  there  other  loss  functions  that  might  be  appropriate,  such  as  the  log  loss 
function?  (See  Haussler  [37]  in  this  regard.)  Also,  can  the  lower  bound  proof  of  Theorem  6.1 
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be  significantly  improved? 

Finally,  consistent  with  our  quest  for  efficient  algorithms  is  the  need  to  be  able  to  recognize 
that  a  learning  problem  is  computationally  intractable.  Various  techniques  in  this  regard  have 
been  developed  in  the  Valiant  model,  such  as  those  of  Pitt  and  Valiant  [68],  and  Kearns  and 
Valiant  [52,  49].  Can  such  techniques  be  extended  to  the  p-concept  model?  Both  of  these 
results  seem  to  depend  crucially  on  the  deterministic  nature  of  the  Valiant  model.  What  then 
would  a  negative,  computational  result  look  like  in  the  p-concept  model? 


Chapter  5 


Inference  of  Finite  Automata  Using 

Homing  Sequences 


5-1  Introduction 

Imagine  a  simple,  autonomous  robot  placed  in  an  unfamiliar  environment.  Typically,  such  a 
robot  would  be  equipped  with  some  sensors  (a  camera,  sonar,  a  microphone,  etc.)  that  provide 
the  robot  limited  information  about  the  state  of  its  environment.  Being  autonomous,  the  robot 
would  also  have  some  simple  actions  that  it  has  the  option  of  executing  (step  ahead,  turn  left, 
lift  arm.  etc.). 

For  instance,  the  robot  might  be  in  the  simple  toy  environment  of  Figure  1.  In  this  envi¬ 
ronment,  the  robot  can  sense  its  local  environment  (whether  the  “room'’  it  occupies  is  shaded 
or  not),  and  can  traverse  one  of  the  out-going  edges  by  executing  action  “x”  or  action  “y.” 

A  priori,  the  robot  may  not  be  aware  of  the  ‘“meaning”  of  its  actions,  nor  of  the  sense  data 
it  is  receiving.  It  may  also  have  little  or  no  knowledge  beforehand  about  the  “structure”  of  its 
environment. 

This  problem  motivates  the  research  presented  in  this  chapter:  how  can  the  robot  infer  on 
its  own  from  experience  a  good  model  of  its  world?  Specifically,  such  a  model  should  explain 
and  predict  how  the  robot’s  actions  affect  the  sense  data  received. 

Certainly,  once  such  a  model  has  been  inferred,  the  robot  can  function  more  effectively  in 
the  learned  environment.  However,  programming  the  robot  with  a  complete  model  of  a  fairly 
complex  environment  would  be  prohibitively  difficult;  what’s  more,  even  if  feasible,  a  robot 
with  a  pre-programmed  world  model  is  entirely  lacking  in  flexibility  and  would  likely  have  a 
hard  time  coping  in  environments  other  than  the  one  for  which  it  was  programmed.  Thus,  the 
development  of  effective  learning  methods  would  both  simplify  the  job  of  the  programmer,  and 
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Figure  1:  An  example  robot  environment, 
make  for  a  more  versatile  robot. 

This  problem  of  learning  about  a  new  environment  from  experience  has  been  addressed 
by  a  number  of  researchers  using  a  variety  of  approaches:  Drescher  [21]  explores  learning 
in  quite  rich  environments  using  an  approach  based  on  Piaget’s  theories  of  early  childhood 
development.  Wilson  [87]  studies  so-called  genetic  algorithms  for  learning  by  “animats”  in 
unfamiliar  environments.  Kuipers  and  Bvun  [56]  advocate  a  “qualitative”  approach  to  the 
related  problem  of  learning  a  map  of  a  mobile  robot's  environment.  The  map-learning  problem 
is  also  studied  by  Mataric  [61]. 

In  this  chapter,  we  take  an  initial  step  toward  a  general,  algorithmic  solution  to  the  robot’s 
learning  problem.  Specifically,  we  give  a  thorough  treatment  to  the  problem  of  inferring  the 
structure  of  an  environment  that  is  known  a  priori  to  be  deterministic  and  finite  state.  Such  an 
environment  can  be  naturally  modeled  as  a  deterministic  finite-state  automaton:  the  robot's 
actions  then  are  the  inputs  to  the  automaton,  and  the  automaton's  output  is  just  the  sense  data 
the  robot  receives  from  the  environment.  Our  goal  then  is  to  infer  the  unknown  automaton  by 
observing  its  input-output  behavior. 

This  problem  has  been  well  studied  by  the  theoretical  community,  and  it  continues  to 
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generate  new  interest.  (See  Pitt’s  paper  [67]  for  an  excellent  survey.)  Virtually  all  previous 
research,  however,  has  assumed  that  the  learner  has  a  means  of  “resetting”  the  automaton  to 
some  start  state.  Such  an  assumption  is  quite  unnatural,  given  our  motivation;  as  in  real  life, 
we  expect  the  robot  to  learn  about  its  environment  in  one  continuous  experiment.  The  main 
result  of  this  chapter  is  the  first  set  of  provably  effective  algorithms  for  inferring  finite-state 
automata  in  the  absence  of  a  reset. 

Here  is  a  brief  history  of  some  of  the  previous,  theoretical  work  on  inference  of  automata.  The 
most  important  lesson  of  this  research  has  been  that  a  combination  of  active  experimentation 
and  passive  observation  is  both  necessary  and  sufficient  to  learn  an  unknown  automaton. 

Angluin  [3]  and  Gold  [29]  show  that  it  is  NP-complete  to  find  the  smallest  automaton 
consistent  with  a  given  sample  of  input-output  pairs.  Pitt  and  Warmuth  [69]  show  that  merely 
finding  an  approximate  solution  is  intractable  (assuming  P  ^  NP).  In  the  Valiant  model, 
Kearns  and  Valiant  [52]  consider  the  problem  of  predicting  the  output  of  the  automaton  on 
a  randomly  chosen  input,  based  on  a  random  sample  of  the  machine’s  behavior.  Extending 
the  work  of  Pitt  and  Warmuth  [70],  they  show  that  this  problem  is  intractable,  assuming  the 
security  of  various  cryptographic  schemes.  Thus,  learning  by  passively  observing  the  behavior 
of  the  unknown  machine  is  apparently  infeasible. 

What  about  learning  by  actively  experimenting  with  it?  Angluin  [5]  shows  that  this  prob¬ 
lem  is  also  hard.  She  describes  a  family  of  automata  which  cannot  be  identified  in  less  than 
exponential  time  when  the  learner  can  only  observe  the  behavior  of  the  machine  on  inputs  of 
the  learner’s  own  choosing.  The  difficulty  here  is  in  accessing  certain  hard-to-reach  states. 

In  spite  of  these  negative  results.  Angluin  [6],  elaborating  on  Gold’s  results  [28],  shows 
that  a  combination  of  active  and  passive  learning  is  feasible.  Her  inference  procedure  is  able  to 
experiment  with  the  unknown  automaton,  and  is  given,  in  response  to  each  incorrect  conjecture 
of  the  automaton’s  identity,  a  counterexample,  a  string  that  is  misclassified  by  the  conjectured 
automaton.  Her  algorithm  exactly  identifies  the  unknown  automaton  in  time  polynomial  in  the 
automaton’s  size  and  the  length  of  the  longest  counterexample. 

As  mentioned  above,  a  serious  limitation  of  Angluin’s  procedure  is  its  critical  dependence 
on  a  means  of  resetting  the  automaton  to  a  fixed  start  state.  Thus,  the  learner  can  never 
really  “get  lost”  or  lose  track  of  its  current  state  since  it  can  always  reset  the  machine  to  its 
start  state.  In  this  chapter,  we  extend  Angluin’s  algorithm,  demonstrating  that  an  unknown 
automaton  can  be  inferred  even  when  the  learner  is  not  provided  with  a  reset. 

This  chapter  also  includes  an  improved  version  of  Angluin’s  algorithm  in  the  case  that  a 
reset  is  available;  this  improved  algorithm  significantly  reduces  the  number  of  experiments  that 
must  be  performed  by  the  learner. 

The  generality  of  our  results  allows  us  to  handle  any  “directed-graph  environment,”  such 
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as  the  one  in  Figure  1.  This  means  that  we  can  handle  many  special  cases  as  well,  such  as 
undirected  graphs,  planar  graphs,  and  environments  with  special  spatial  relations.  However, 
our  procedures  do  not  take  advantage  of  such  special  properties  of  these  environments,  some  of 
which  could  probably  be  handled  more  effectively.  For  example,  we  have  found  that  permutation 
automata  are  generally  easier  to  handle  than  non- permutation  automata. 

Previously,  Rjvest  and  Schapire  [73,  75,  79]  introduced  the  “diversity-based”  representation 
of  finite  automata,  an  egocentric  and  often  quite  compact  representation.  They  also  described 
an  algorithm  that  was  proved  to  be  effective  for  permutation  automata,  even  in  the  absence  of 
a  reset.  Some  general  techniques  for  handling  non-permutation  automata  were  also  discussed; 
although  not  provably  effective,  these  seemed  to  work  well  in  practice  for  a  variety  of  simple 
environments. 

In  this  chapter,  we  generalize  these  results,  demonstrating  probabilistic  inference  procedures 
which  are  provably  effective  for  both  permutation  and  non-permutation  automata.  More  gen¬ 
erally,  we  present  new  inference  procedures  for  the  usual  global  state  representation,  as  well  as 
for  the  diversity-based  representation. 

Like  Angluin,  we  assume  that  the  inference  procedures  have  an  unspecified  source  of  coun¬ 
terexamples  to  incorrectly  conjectured  models  of  the  automaton.  This  differs  from  Rjvest  and 
Schapire’s  previous  work  where  the  learning  model  incorporated  no  such  source  of  counterex¬ 
amples;  as  already  mentioned,  this  limitation  makes  learning  of  finite  automata  infeasible  in 
the  general  case.  For  a  robot  trying  to  infer  the  structure  of  its  environment,  a  counterexam¬ 
ple  is  discovered  whenever  the  robot’s  current  model  makes  an  incorrect  prediction.  For  the 
special  class  of  permutation  automata,  we  show  that  an  artificial  source  of  counterexamples  is 
unnecessary. 

Our  algorithms  use  powerful  new  techniques  based  on  the  inference  of  homing  sequences.  In¬ 
formally,  a  homing  sequence  is  a  sequence  of  inputs  that,  when  fed  to  the  machine,  is  guaranteed 
to  “orient”  the  learner:  the  outputs  produced  for  the  homing  sequence  completely  determine 
the  state  reached  by  the  automaton  at  the  end  of  the  homing  sequence.  Every  finite-state 
machine  has  a  homing  sequence.  For  each  inference  problem,  we  show  how  a  homing  sequence 
can  be  used  to  infer  the  unknown  machine,  and  how  a  homing  sequence  can  be  inferred  as  part 
of  the  overall  inference  procedure. 

In  sum,  the  main  results  of  this  chapter  are  four- fold:  We  describe  efficient  algorithms  for 
inference  of  general  finite  automata  using  both  the  state-based  and  the  diversity-based  repre¬ 
sentations;  both  of  these  algorithms  require  a  means  of  experimenting  with  the  automaton  and 
a  source  of  counterexamples.  Then,  for  permutation  automata,  we  give  efficient  algorithms  for 
both  representations  that  do  not  require  an  external  source  of  counterexamples.  The  time  of  the 
diversity-based  algorithm  for  permutation  automata  beats  the  best  previous  bound  by  roughly 
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a  factor  of  D3/\o&D,  where  D  is  the  size  of  the  automaton  using  the  diversity- based  represen¬ 
tation.  In  the  other  three  cases,  our  procedures  are  the  first  provably  effective  polynomial-time 
algorithms. 


5-2  Two  representations  of  finite  automata 

5-2.1  The  global  state-space  or  standard  representation 

An  environment  or  finite-state  automaton  £  is  a  tuple  (Q,B,6,  qo,i)  where: 

•  Q  is  a  finite  nonempty  set  of  states , 

•  B  is  a  finite  nonempty  set  of  input  symbols  or  basic  actions , 

•  6  is  the  next-state  or  transition  function ,  which  maps  Q  x  B  into  Q , 

•  </o,  a  member  of  Q,  is  the  initial  state ,  and 

•  7  is  the  output  function,  which  maps  Q  into  {0, 1}. 

This  is  the  standard,  or  state-based,  representation. 

For  example,  the  graph  of  Figure  1  depicts  the  global  state  representation  of  an  automaton 
whose  states  are  the  vertices  of  the  graph,  whose  transition  function  is  given  by  the  edges,  and 
whose  output  function  is  given  by  the  shading  of  the  vertices. 

We  denote  the  set  of  all  finitely  long  action  sequences  by  A  =  f?*,  and  we  extend  the 
domain  of  the  function  ■)  to  A  in  the  usual  way:  S(q,  A)  =  q,  and  6(q,ab)  =  S(6(q,a),b)  for 
all  q  €  Q,a  €  A,b  e  B.  Here,  A  denotes  the  empty  or  null  string.  Thus,  6(q,  a)  denotes  the 
state  reached  by  executing  sequence  a  from  state  q;  for  shorthand,  we  often  write  qa  to  denote 
this  state. 

We  say  that  £  is  a  permutation  automaton  if  for  every  action  b,  the  function  S(-,b )  is  a 
permutation  of  Q. 

We  refer  to  the  sequence  of  outputs  produced  by  executing  a  sequence  of  actions  a  = 
bib7  . .  .br  from  a  state  q  as  the  output  of  a  at  q,  denoted  q{a): 

q(a)  =  (-f(q),K(qbl),f(qbib7)1 . . .  ,i(qb^b7  . .  .br)). 

For  instance,  if  the  robot  in  Figure  1  executes  action  a  =  xy  from  its  current  state  q  =  3,  then 
it  will  observe  the  sequence  of  actions 


q(a)  =  3<xy)  =  □  ®  • 
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(Don’t  confuse  7 (qa)  and  q{a).  The  former  is  a  single  value,  the  output  of  the  state  reached 
by  executing  a  from  9;  for  instance,  7 (qa)  =  CD  in  the  example  above.  In  contrast,  q(a)  is  a 
(|a|  +  l)-tuple  consisting  of  the  sequence  of  outputs  produced  by  executing  a  from  state  q.) 
Finally,  for  a  6  A,  we  denote  by  Q(a)  the  set  of  possible  outputs  on  input  a: 

Q(a)  =  {q(a)  :  9  €  <?}• 


Clearly,  |Q(a)j  <  \Q\  for  any  a. 

Action  sequence  a  is  said  to  distinguish  two  states  91  and  q2  if  91(a)  ^  92(a)-  For  instance 
xy  distinguishes  states  3  and  4  of  the  environment  of  Figure  1,  but  not  states  1  and  2.  We 
assume  that  £  is  reduced  in  the  sense  that,  for  every  pair  of  distinct  states,  there  is  some  action 
sequence  which  distinguishes  them. 

5-2.2  The  diversity-based  representation 

In  this  section,  we  describe  the  second  of  our  representations.  See  Rivest  and  Schapire’s  pa¬ 
pers  [73,  75,  79]  for  further  background  and  detail.  The  representation  is  based  on  the  notion 
of  tests  and  test  equivalence. 

A  test  is  an  action  sequence.  (This  definition  differs  slightly  from  that  given  in  previous 
papers  where  the  automata  considered  had  multiple  outputs  (or  “sensations”)  at  each  state.) 
The  value  of  a  test  t  at  state  q  is  7(9^,  the  output  of  the  state  reached  by  executing  t  from  q. 

Two  tests  and  t2  are  equivalent .  wri  ten  ti  =  <2,  if  the  tests  have  the  same  value  at  every 
state.  For  instance,  in  the  environment  of  Figure  1,  tests  yxx  and  xx  are  equivalent,  as  are  tests 
yy  and  A. 

It’s  easy  to  verify  that  defines  an  equivalence  relation  on  the  set  of  tests.  We  write  [t] 
to  denote  the  equivalence  class  of  t.  the  set  of  tests  equivalent  to  t.  The  value  of  [<]  at  q  is  well 
defined  as  7 (qt).  The  diversity  of  the  environment,  D(£),  is  the  number  of  equivalence  classes 
of  the  automaton:  D(£)  =  |{[i]  :  t  €  A}|.  It  can  be  shown  that  lg(|Q|)  <  D{£)  <  2I<?',  so  the 
diversity  of  a  finite  automaton  is  always  finite  [73,  79]. 

The  equivalence  classes  can  be  viewed  as  state  variables  whose  values  entirely  describe  the 
state  of  the  environment.  This  is  true  because  two  states  are  equal  (in  a  reduced  sense)  if  and 
only  if  every  test  has  the  same  value  in  both  states. 

It  is  often  convenient  to  arrange  the  equivalence  classes  in  an  update  graph  such  as  the  one 
in  Figure  2  for  the  environment  of  Figure  1.  Each  vertex  in  the  graph  is  an  equivalence  class  so 
the  size  of  the  graph  is  D(£).  An  edge  labeled  6  €  B  is  directed  from  vertex  [<j]  to  [<2]  if  and 
only  if  <1  =  bt2.  Note  that  each  vertex  has  exactly  one  in-going  edge  labeled  with  each  of  the 
basic  actions.  This  is  because  if  tx  =  t2  then  bt i  =  bt2. 
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Figure  2:  The  update  graph  for  the  environment  of  Figure  1. 

We  associate  with  each  vertex  [<]  the  value  of  t  in  the  current  state  q.  In  the  figure,  we  have 
used  shading  to  indicate  the  value  of  each  vertex  in  the  robot’s  current  state.  The  output  of  the 
current  state  is  given  by  vertex  [A],  so  this  is  the  only  vertex  whose  value  can  be  observed  by  the 
robot.  When  an  action  6  is  executed  from  q,  each  vertex  [t]  is  replaced  by  the  old  value  of  [bt], 
the  vertex  at  the  tail  of  [/]’s  (unique)  in-going  6-edge.  That  is,  in  the  new  state  qb,  equivalence 
class  [<]  takes  on  the  old  value  of  [6<j  in  the  starting  state  q.  This  follows  from  the  fact  that 
')i(qb)t)  =  ~f(q(bt)).  For  instance,  if  action  y  is  executed  in  the  environment  of  Figures  1  and  2, 
then  the  value  of  [A]  in  the  new  state  is  §8  ,  the  old  value  of  [y];  the  new  value  of  [yxy]  is  □  , 
the  old  value  of  [xy]. 

Thus,  the  value  of  each  equivalence  class  in  the  state  reached  by  executing  any  action  can 
be  determined  easily  using  the  update  graph.  Thus,  the  update  graph  can  be  used  to  simulate 
the  environment. 


5-3 


Homing  sequences  147 


Simple-assignment  automata 

The  update  graph  can  be  viewed  more  abstractly  as  a  special  kind  of  automaton:  A  simple- 
assignment  automaton  S  is  a  tuple  (V,  B,T,  v0,u>)  where: 

•  V  is  a  finite  nonempty  set  of  variables , 

•  B  is  a  finite  nonempty  set  of  input  symbols  or  basic  actions. 

•  T  is  the  update  function,  which  maps  V  x  B  into  V, 

•  v0,  a  member  of  V,  is  the  output  variable,  and 

•  u  is  the  initial-value  function  which  maps  V  into  {0,1}. 

Here,  we  interpret  V  as  a  vector  of  state  variables  whose  values  determine  the  state  of  S. 
The  initial  values  of  these  variables  are  given  by  w,  and  the  output  of  the  machine  is  the  current 
value  of  the  special  variable  i’0.  When  an  action  b  £  B  is  executed,  each  variable  v  is  updated 
in  the  new  state  with  the  old  value  of  variable  T (v,b).  The  function  T  can  be  extended  to  the 
domain  V  x  A  in  the  usual  manner  by  defining  T(u,A)  =  v  and  T(v,ba)  =  T(T(u,a),6)  for 
v  €  V,  a  £  A  and  b  £  B.  Thus,  when  a  is  executed,  variable  v  is  updated  with  the  old  value  of 
variable  T(v,  a).  In  particular,  this  means  that  the  output  of  S  after  executing  a  £  A  from  the 
initial  state  is  u>(T(u0,a)). 

Thus,  the  update  graph  is  itself  a  simple-assignment  automaton.  In  this  case,  the  set 
V  is  the  set  of  equivalence  classes  {[<]  :  t  £  A};  the  update  function  is  defined  by  the  rule 
T([<],6)  =  [6<];  the  output  variable  is  vQ  =  [A];  and  w([<])  is  the  value  of  [<]  in  the  initial  state, 
~f(qot)-  With  these  definitions,  it  is  straightforward  to  verify  then  that  T(r0,<)  =  [<],  and  so 

w(T(r0,0)  =  liHqo ,<))  f°r  all  t  €  A. 

On  first  blush,  the  structures  of  simple-assignment  automata  (such  as  the  update  graph  of 
Figure  2)  and  of  ordinary  finite-state  automata  (such  as  the  one  given  by  the  transition  diagram 
of  Figure  1)  appear  to  be  quite  similar.  In  fact,  their  interpretations  are  very  different.  In  the 
global-state  representation,  the  robot  moves  from  state  to  state  while  the  output  values  of 
the  states  remain  unchanged.  On  the  other  hand,  in  the  diversity- based  (or  simple-assignment) 
representation,  the  robot  remains  stationary,  only  observing  the  output  of  a  single  variable  ([A]), 
and  causing  with  its  actions  the  values  of  the  variables  to  move  around.  Thus,  the  diversity- 
based  representation  is  more  egocentric  —  the  world  is  represented  relative  to  the  robot.  In 
contrast,  in  the  state-based  representation,  the  world  is  represented  by  its  global  structure. 

5-3  Homing  sequences 


Henceforth,  we  set  D  =  D(S).  n  =  \Q\,  k  =  ji?|. 
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Input:  £  -  a  finite-state  automaton 
Output:  ft  -  a  homing  sequence 
Procedure: 

1  ft  -  A 

2  while  q\{h)  =  q2(h)  but  qih  ^  q2h  for  some  q\,q2  €  Q  do 

3  let  x  6  A  distinguish  qlh  and  q2h 

4  ft  <—  hx 

5  end 


Figure  3:  A  state-based  algorithm  for  constructing  a  homing  sequence. 

A  homing  sequence  is  an  action  sequence  h  for  which  the  state  reached  by  executing  h  is 
uniquely  determined  by  the  output  produced:  thus,  h  is  a  homing  sequence  if  and  only  if 

(V?!  €  €  Q)  qi(h)  =  q2(h)  =>  q,h  =  q2h. 

For  example,  the  string  consisting  of  the  single  action  “x”  is  a  homing  sequence  for  the 
environment  of  Figure  1.  If  q(x)  =  □  □  ,  then  qx  =  3;  if  q(x)  =  O  ®  ,  then  qx  =  2;  and,  if 
q(x)  =  0  S  then  qx  =  1. 

Kohavi  [55]  gives  a  complete  discussion  of  homing  sequences.  He  distinguishes  between 
preset  and  adaptive  homing  sequences.  Initially,  we  make  use  only  of  the  former  because  they 
are  simpler;  later,  we  show  that  our  inference  procedures  can  be  improved  using  adaptive  homing 
sequences. 

Given  full  knowledge  of  the  structure  of  £,  it  is  easy  to  construct  a  homing  sequence  ft,  as 
shown  in  Figure  3.  Initially,  ft  =  A.  On  each  iteration  of  the  loop,  a  new  extension  x  is  appended 
to  the  end  of  ft  so  that  ft  now  distinguishes  two  states  not  previously  distinguished.  Thus, 
|Q(ft)|  <  |Q(ftx)|  <  n,  and  therefore  the  program  will  terminate  after  at  most  n  —  1  iterations. 
Further,  since  each  extension  need  only  have  length  n  -  1  (see,  for  instance.  Kohavi  [55], 
Theorem  10-2),  we  have  shown  how  to  construct  a  homing  sequence  of  length  at  most  (n  —  l)2. 

A  diversity-based  homing  sequence  is  an  action  sequence  ft  which  has  the  property  that  for 
every  test  t ,  there  exists  a  prefix  p  of  ft  such  that  p  =  ht.  For  instance,  it  can  be  shown  that 
ft  =  xxyx  is  a  diversity- based  homing  sequence  for  the  environment  represented  in  Figures  1 
and  2.  For  example,  if  t  =  yxy  then  ht  =  xxyxyxy  =  xxy. 

Every  diversity-based  homing  sequence  ft  is  a  homing  sequence.  For  if  q\h  ^  q2h  then  there 
is  some  t  for  which  ~j{qlht)  liq^ht).  Since  ht  is  equivalent  to  some  prefix  p  of  ft,  we  have 
7(<7iP)  #  liqiP)-  Thus.  qv{h)  /  q2(h). 

Figure  4  shows  an  algorithm  for  constructing  a  diversity- based  homing  sequence  ft.  Again, 
ft  is  built  up  from  A  by  appending  extensions  x.  On  each  iteration,  the  cardinality  of  the  set 
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Input:  €  -  a  finite-state  automaton 

Output:  h  -  a  diversity-based  homing  sequence 

Procedure: 

1  h  «-  A 

2  while  (3x€  A)(Vp  prefix  of  h)  p  ^  hx  do 

3  h  *—  hx 

4  end 

Figure  4:  A  diversity- based  algorithm  for  constructing  a  homing  sequence. 

{[p]  :  P  prefix  of  h}  increases  by  at  least  one;  since  the  cardinality  of  this  set  is  clearly  bounded 
by  D,  there  can  be  at  most  D  -  1  iterations.  Also,  each  extension  need  be  no  longer  than  D  —  1. 
(For  if  |x|  >  D,  then  x  has  at  least  D  +  1  suffixes,  at  least  two  of  which  must  be  equivalent. 
Thus,  for  some  p,  r,  s,  x  =  prs,  rs  =  s  and  r  A;  therefore,  ps  is  a  shorter  extension  of  h  than  x 
for  which  hps  is  inequivalent  to  every  prefix  of  h.)  Thus,  we  can  find  a  diversity-based  homing 
sequence  of  length  at  most  (D  -  l)2. 

Some  other  remarks  about  the  length  of  homing  sequences:  First,  the  homing  sequences 
constructed  by  the  preceding  algorithms  are  the  best  possible  in  the  sense  that  there  exist 
environments  whose  shortest  homing  sequence  has  length  fi(rr)  (or  Q(Z)2)).  However,  given 
a  state-based  (or  a  diversity-based)  description  of  a  finite-state  machine,  it  is  NP-complete  to 
find  the  shortest  homing  sequence  for  the  automaton.  (This  can  be  shown,  for  instance,  by  a 
reduction  from  exact  3-set  cover.) 

5-4  A  state-based  algorithm  for  general  automata 

In  this  and  the  next  sections,  we  describe  general  algorithms  for  inferring  the  structure  of  an 
unknown  environment  €. 

We  say  that  the  learner  has  a  perfect  model  of  its  environment  if  it  can  predict  perfectly  the 
output  of  the  environment  given  any  sequence  of  actions.  The  goal  of  our  inference  procedures 
is  to  construct  a  perfect  model. 

We  assume  that  the  learner  is  given  access  to  £,  that  the  learner  can  observe  the  output 
of  the  environment  when  actions  of  its  choosing  are  executed.  We  also  assume  that  there  is  a 
“teacher”  who  provides  the  learner  with  counterexamples  to  incorrectly  conjectured  models  of 
the  environment.  A  counterexample  is  a  sequence  of  actions  whose  true  output  from  the  current 
state  differs  from  that  predicted  by  the  learner's  model.  Typically,  there  will  be  many  sequences 
of  actions  which  are  counterexamples  to  a  given  conjecture,  and  by  choosing  an  especially  long 
or  short  counterexample,  the  teacher  can  significantly  affect  the  running  time  of  the  procedure. 
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This  fact  is  reflected  in  our  running  times  which  depend  on  the  length  of  the  counterexamples 
provided. 

In  the  framework  of  a  robot  learning  about  its  environment,  we  might  imagine  the  robot, 
upon  completion  of  a  model  of  the  environment  which  it  believes  to  be  correct,  using  that 
model  to  make  predictions  of  the  output  of  the  environment’s  next  state  until  an  incorrect 
prediction  is  made.  In  this  situation,  the  sequence  of  actions  leading  up  to  the  error  is  the 
needed  counterexample. 

We  generally  assume  that  the  unknown  automaton  is  strongly  connected,  that  is,  every  state 
can  be  reached  from  every  other  state: 

(V$1  6  Q)(Vq 2  €  Q)(3a  €  ^4)(9iO  =  92)* 

We  make  this  assumption  with  little  loss  of  generality:  if  £  is  not  strongly  connected,  then  an 
experimenting  inference  procedure,  having  no  reset  operation,  will  sooner  or  later  fall  into  a 
strongly  connected  component  of  the  state  space  from  which  it  cannot  escape,  and  so  will  have 
to  be  content  thereafter  learning  only  about  that  component. 

This  section  focuses  on  an  algorithm  based  on  the  global  state  representation  for  inferring 
an  arbitrary  unknown  automaton. 

5-4.1  Angluin’s  L’  algorithm 

Our  procedure  is  based  closely  on  Angluin’s  L *  algorithm  for  learning  regular  sets  [6],  Angluin 
shows  how  to  efficiently  infer  the  structure  of  any  finite-state  machine  in  the  presence  of  what 
she  calls  a  minimally  adequate  teacher.  Such  a  teacher  can  answer  two  kinds  of  queries:  On  a 
membership  query,  the  learner  asks  whether  a  given  input  string  w  is  in  the  unknown  language 
U,  that  is,  whether  the  string  is  accepted  by  the  unknown  machine.  On  an  equivalence  query, 
the  learner  conjectures  that  the  unknown  machine  is  isomorphic  to  one  it  has  constructed.  The 
teacher  replies  that  the  conjecture  is  either  correct  or  incorrect,  and  in  the  latter  case  provides 
a  counterexample  w,  a  string  accepted  by  one  machine  but  not  the  other. 

The  idea  of  Angluin’s  algorithm  is  to  maintain  an  observation  table  ( S,E,T ).  Here,  5  is  a 
prefix-closed  set  of  strings,  and  E  is  a  suffix-closed  set  of  strings.  We  can  think  of  5  as  a  set 
of  strings  that  lead  from  the  start  state  to  the  states  of  the  automaton,  and  E  as  experiments 
which  are  executed  from  these  states.  The  last  variable  T  is  a  two-dimensional  table  whose  rows 
are  given  by  5  U  SB,  and  whose  columns  are  given  by  E.  Each  entry  T(se),  where  s  £  S  U  SB 
and  e  €  E,  records  whether  the  string  se  is  in  the  unknown  language.  For  fixed  s,  Angluin 
denotes  by  row(s)  the  vector  of  entries  T(se )  for  varying  e  6  E.  Her  algorithm  extends  5 
and  E  based  on  the  results  of  queries,  and  ultimately  outputs  the  correct  automaton  based  on 
an  equivalence  between  the  states  of  the  unknown  machine  and  the  distinct  rows  of  the  table 
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T.  We  denote  by  JVM  and  iVE  the  number  of  membership  and  equivalence  queries  made  by 
L’.  These  variables  are  implicit  functions  of  n,  k  and  m,  where  m  is  the  length  of  the  longest 
counterexample  received.  For  Angluin’s  procedure  L *,  we  have  Nm  =  0(kmn2),NE  =  n  —  1. 
However,  in  Section  5-4.5  below,  we  show  how  Nm  can  be  improved  to  0(kn 2  +  nlogm). 

In  our  framework,  the  learner  could  easily  simulate  Angluin’s  algorithm  L'  if  it  were  given 
a  reset:  to  perform  a  membership  query  on  w ,  the  learner  resets  the  environment,  and  executes 
the  actions  of  w,  observing  the  output  of  the  last  state  reached.  To  perform  an  equivalence 
query  on  £',  the  learner  resets  the  automaton  and  conjectures  that  £'  is  a  perfect  model  of  the 
environment.  The  teacher  returns  an  action  sequence  u>  on  which  the  conjectured  model  fails; 
this  is  the  counterexample  needed  by  Z,*. 

5-4.2  Using  a  homing  sequence  in  lieu  of  a  reset 

However,  in  our  model  the  learner  is  not  provided  with  a  reset.  The  main  idea  of  our  algorithm 
is  to  replace  the  reset  with  a  homing  sequence.  In  many  respects,  a  homing  sequence  behaves 
like  a  reset:  by  executing  the  homing  sequence,  the  learner  discovers  “where  it  is,”  what  state  it 
is  at  in  the  environment.  However,  unlike  a  reset,  the  final  state  is  not  fixed,  and  the  learner  does 
not  know  beforehand  what  state  it  will  end  up  in.  (Note  that  an  automaton  need  not  possess 
a  synchronizing  sequence ,  a  sequence  that  forces  the  automaton  into  a  given  state  independent 
of  its  starting  state.  So  we  use  homing  sequences  instead.) 

We  begin  by  supposing  that  the  learner  has  been  provided  with  a  correct  homing  sequence  h. 
Later,  we  will  show  how  to  remove  this  assumption. 

Suppose  we  execute  h  from  the  current  state  q,  producing  output  a  —  q(h).  If  we  ever 
repeat  this  experiment  from  state  q'  and  find  q'{h)  =  a,  then,  because  h  is  a  homing  sequence, 
the  states  where  we  finished  must  have  been  the  same  in  both  cases:  qh  =  q'h.  If  we  could 
guarantee  that  the  output  of  h  would  continue  to  come  up  o  with  good  regularity,  then  we 
could  simply  infer  £  by  simulating  Angluin’s  algorithm,  treating  qh  as  the  initial  state.  When 
L‘  demands  a  reset,  we  execute  h:  if  the  output  comes  up  o ,  then  we  must  be  at  qh,  and  our 
“reset”  has  succeeded:  otherwise,  try  again.  Unfortunately,  in  the  general  case,  it  may  be  very 
difficult  to  make  h  produce  o  regularly. 

Instead,  we  simulate  an  independent  copy  L*  of  L‘  for  each  possible  output  o  of  executing 
h,  as  shown  in  Figure  5.  Since  |Q(fi)|  <  n ,  no  more  than  n  copies  of  L *  will  be  created  and 
simulated.  Furthermore,  on  each  iteration  of  the  loop,  at  least  one  copy  makes  one  query  and 
so  makes  progress  towards  inference  of  £.  Thus,  this  algorithm  will  succeed  in  inferring  £  after 
no  more  than  n(NM  +  NE)  iterations. 
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Input:  access  to  £,  a  finite-state  automaton 
A  -  a  homing  sequence  for  € 

Output:  a  perfect  model  of  £ 

Procedure: 

1  repeat 

2  execute  A,  producing  output  a 

3  if  it  does  not  already  exist,  create  Z* ,  a  new  copy  of  Z* 

4  simulate  the  next  query  of  L’a: 

5  if  Z*  queries  the  membership  of  action  sequence  a  then 

6  execute  a  and  supply  Z*  with  the  output  of  the  final  state  reached 

7  if  Z*  makes  an  equivalence  query  then 

8  if  the  conjectured  model  £'  is  correct  then 

9  stop  and  output  £' 

10  else 

11  obtain  a  counterexample  and  supply  it  to  Z* 

12  end 


,  Figure  5:  A  state-based  algorithm  for  inferring  £  given  a  correct  homing  sequence. 

5-4.3  Constructing  a  homing  sequence 

We  now  describe  how  to  combine  construction  of  the  homing  sequence  A  with  the  inference 
of  £.  We  maintain  throughout  the  algorithm  a  sequence  A  which  we  presume  is  a  true  homing 
sequence.  When  evidence  arises  indicating  that  this  is  not  the  case,  we  will  see  how  A  can  be 
extc:  ded  and  improved,  eventually  leading  to  the  construction  of  a  correct  homing  sequence. 
Initially,  we  take  A  =  A. 

We  use  our  presumably  correct  homing  s^  ence  A  as  described  above  and  in  Figure  5.  If  A 
is  indeed  a  true  homing  sequence,  we  will  of  course  succeed  in  inferring  £. 

On  the  other  hand,  if  A  is  incorrect,  we  may  discover  inconsistent  behavior  in  the  course 
of  simulating  some  copy  of  Z":  suppose  on  two  different  iterations  of  the  loop  in  Figure  5,  we 
begin  in  states  qx  and  q 2,  execute  h,  produce  output  qx(h)  =  <72(A)  =  <7,  and,  as  part  of  the 
simulation  of  Z*,  execute  action  sequence  x.  If  h  were  a  homing  sequence,  then  x's  output 
would  have  to  be  the  same  on  both  iterations  since  qxh  and  q2 h  must  be  equal. 

However,  if  h  is  not  a  homing  sequence,  then  it  may  happen  that  qxh(x)  ^  q2h{x).  That  is, 
we  have  discovered  that  x  distinguishes  qxh  and  q2h,  and  so,  just  as  was  done  in  the  algorithm 
of  Figure  3,  we  replace  h  with  hx.  producing  in  a  sense  a  “better”  approximation  to  a  homing 
sequence.  At  this  point,  the  existing  copies  of  Z*  are  discarded,  and  the  algorithm  begins  from 
scratch  (except  for  resetting  A,  of  course).  Since  A  can  only  be  extended  in  this  fashion  n  —  1 
times,  this  only  means  a  slowdown  by  at  most  a  factor  of  n,  compared  to  the  algorithm  of 
Figure  5. 


5 


A  state-based  algorithm  for  general  automata  153 


Input:  access  to  Z,  a  finite-state  automaton 
n  -  the  number  of  states  of  Z 
Output:  a  perfect  model  of  Z 
Procedure: 

1  h  «-  A 

2  repeat 

3  execute  fi,  producing  output  o 

4  if  it  does  not  already  exist,  create  Z* ,  a  new  copy  of  L' 

5  if  |{roui(s) :  s  €  5»}|  <  n  then 

6  simulate  the  next  query  of  Z*  as  in  Figure  5  (and  check  for  inconsistency) 

7  else 

8  let  {sj, . .  .,sn  +1}  C  Sg  be  such  that  row(si)  ^  rou>(sj) 

9  randomly  choose  a  pair  Si,Sj  from  this  set 

10  let  e  G  Ea  be  such  that  T„(s,e)  ^  Ta(Sje) 

11  with  equal  probability,  re-execute  either  s,e  or  s;e  (and  check  for  inconsistency) 

12  if  inconsistency  found  executing  some  string  x  then 

13  discard  all  existing  copies  of  Z* 

14  h  *—  hx 

15  until  a  correct  conjecture  is  made 


Figure  6:  A  state-based  algorithm  for  inferring  Z. 

Figure  6  shows  how  we  have  implemented  these  ideas.  Here  we  have  assumed  n,  the  number 
of  global  states,  has  been  provided  to  the  learner.  In  fact,  this  assumption  is  entirely  unneces¬ 
sary.  Although  we  omit  the  details,  we  can  show  that  the  stated  bounds  below  hold  (up  to  a 
constant)  for  a  slightly  modified  algorithm  which  does  not  require  that  the  learner  be  explicitly 
provided  with  the  value  of  n.  The  trick  is  the  usual  one  of  repeatedly  doubling  our  estimate 
of  n. 

Recall  that  Z*  requires  maintenance  of  an  observation  table  ( S,E,T ).  Let  (Sa,  Eo,Ta) 
denote  the  observation  table  of  L’a.  Of  course,  Ta  can  only  record  output  produced  when 
executing  an  action  sequence  from  what  is  only  presumed  to  be  a  fixed  initial  state. 

Angluin’s  analysis  implies  that  if  Z*  makes  more  than  Nm  +  Ne  queries,  then  the  number 
of  distinct  rows  will  exceed  n.  This  can  only  happen  if  h  is  not  a  homing  sequence,  but  how 
do  we  know  how  to  correctly  extend  h  if  we  have  not  actually  seen  an  inconsistency?  We  show 
that  if  an  inconsistency  has  not  been  found  by  the  time  the  number  of  rows  exceeds  n,  then  we 
can  use  a  probabilistic  strategy  to  find  one  quickly  with  high  probability. 

Suppose  we  execute  h  from  state  q,  with  output  cr,  and  we  find  that  for  Z*,  there  are  more 

than  n  distinct  rows.  Then,  as  in  Figure  6,  there  exist  strings  _ _  sn+1  in  Sg  whose  rows 

are  all  distinct.  By  the  pigeon-hole  principle,  there  is  at  least  one  pair  of  distinct  rows  sitSj 
such  that  qhsi  =  qhsj.  Further,  since  row(Si)  /  rote(Sj),  there  is  some  e  £  Eg  for  which 
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Ta(s <e)  /  Ta(Sje).  However,  'f(qhsie)  —  i(qhSje).  Therefore,  either  7 (ghs.e)  ^  Ta(Sie)  or 
7 (qhsje)  /  r„(Sje),  and  so  re-executing  (or  s:e,  respectively)  from  the  current  state  qh  will 
produce  the  desired  inconsistency.  (Recall  that  T,  records  the  results  of  previous  executions  of 
these  strings.) 

So  the  chance  of  randomly  choosing  the  correct  pair  Si,Sj  as  above  is  at  least  ("j1)  ,  and 

the  chance  of  then  choosing  the  correct  experiment  to  re-run  of  s,e  or  s;e  is  at  least  1/2.  Thus, 
the  probability  of  finding  an  inconsistency  using  the  technique  of  Figure  6  in  this  situation  is 
at  least  l/n(n  +1).  Repeating  this  technique  n(n  +  l)ln(l/6)  times  gives  a  probability  of  at 
least  1  -  6  of  finding  an  inconsistency.  Also,  no  more  than  n2  copies  of  L’  are  ever  created,  and 
\h\  does  not  exceed  0(n2  +  nm)  since  h  is  extended  at  most  n  —  1  times,  and  each  extension 
has  length  0(n  +  m).  a  bound  on  the  length  of  any  query  required  by  Lm. 

Putting  these  facts  together,  we  have  proved: 

Theorem  4.1  Given  6  >  0,  the  algorithm  described  in  Figure  6  halts  and  outputs  a  perfect 
model  with  probability  at  least  1  —  6  in  time  polynomial  in  n,  m,  k  and  1/6,  and  after  executing 

0(n3(n  +  m)(n2  log(n/<5)  +  NM  +  NE )) 


actions. 

If  we  assume  m  =  O(n)  and  k  =  0(1)  and  use  the  previously  given  bounds  on  NM  and  NE, 
then  the  number  of  actions  executed  by  the  procedure  (and  the  running  time  as  well)  simplifies 
to  0(n6  log (n/6)). 

The  procedure  can  be  modified,  replacing  the  preset  homing  sequence  which  we  have  been 
using  with  an  adaptive  one  whose  input  at  each  step  depends  on  the  output  seen  up  to  that 
point.  This  modification  shaves  a  factor  of  n  off  the  bound  given  above,  and  is  described  in 
greater  detail  in  the  next  section. 

It  is  an  open  question  whether  this  bound  can  be  significantly  tightened.  It  seems  likely  that 
an  algorithm  which  combines  the  many  copies  of  Lm  into  one  would  have  a  superior  running 
time,  but  we  have  not  been  successful  in  implementing  this  intuition. 

5-4.4  Adaptive  homing  sequences 

The  algorithm  of  Figure  6  is  certainly  quite  wasteful  in  that,  when  h  is  discovered  not  to  be 
a  homing  sequence,  everything  is  thrown  away  and  the  algorithm  starts  over  from  scratch.  As 
a  result,  up  to  n  copies  of  L*  are  discarded  each  time  h  is  extended.  Since  h  can  be  extended 
up  to  n  —  1  times,  this  means  as  many  as  n~  copies  of  L’  may  eventually  be  simulated  by  the 
algorithm. 
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Figure  7:  An  example  adaptive  homing  sequence. 

In  this  section,  we  describe  a  way  of  modifying  the  procedure  so  that  only  one  copy  of  Lm 
is  discarded  when  h  is  extended,  leading  to  an  0(n)  bound  on  the  total  number  of  copies  of  L‘ 
simulated. 

As  mentioned  above,  the  idea  of  the  modification  is  to  replace  our  preset  homing  sequence 
(the  kind  described  up  to  this  point)  with  an  adaptive  one.  In  many  ways,  preset  homing 
sequences  a."?  rather  inefficient  tools.  For  example,  it  may  be  that,  starting  from  some  states, 
executing  only  half  the  sequence  is  sufficient  to  reach  a  state  uniquely  determined  by  the 
observed  output.  An  adaptive  homing  sequence  is  a  much  more  intelligent  kind  of  homing 
sequence.  It  is  like  a  preset  homing  sequence  in  that  the  output  observed  can  be  used  to 
determine  the  state  reached.  However,  the  difference  is  that  the  action  executed  at  each  step 
may  depend  on  the  output  observed  up  to  that  point. 

Despite  its  name,  an  adaptive  sequence  a  is  not  a  sequence  at  all  but  a  decision  tree  with 
the  following  properties:  The  root  node  of  a  is  labeled  A,  and  each  of  the  other  nodes  in  the 
tree  is  labeled  with  one  of  the  basic  actions  in  B.  Every  node  has  at  most  one  0-child.  and  at 
most  one  1-child.  An  example  adaptive  sequence  is  given  in  Figure  7. 

An  adaptive  sequence  is  executed  in  a  natural  manner:  We  begin  at  the  root  node.  If  the 
output  of  the  current  state  is  0  (or  1)  then  we  proceed  to  the  0-child  (1-child)  of  the  root.  The 
basic  action  labeling  the  node  reached  is  then  executed,  and.  based  on  the  resulting  output,  we 
proceed  down  the  tree  in  the  same  fashion,  at  each  step  branching  to  the  0-  or  1 -child  depending 
on  the  output  observed.  This  continues  until  we  “fall  off”  the  tree,  i.e.,  until  it  is  necessary  to 
move  to  a  node  that  does  not  exist  in  the  tree. 

For  example,  if  the  tree  of  Figure  7  is  executed  from  the  current  state  of  Figure  1.  then  the 
action  sequence  “x”  will  be  executed  producing  output  I  I  B81  :  if  the  sequence  is  executed  from 
state  4,  then  “xy“  will  be  executed  with  output  □  Q  0,  (An  adaptive  homing  sequence  can 
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be  naturally  defined  in  terms  of  any  set  of  output  symbols,  such  as  Q  and  0  ,  rather  than  0 
and  1.) 

As  with  ordinary  sequences,  we  write  qa  to  denote  the  state  reached  by  executing  adaptive 
sequence  a  from  state  q ,  and  we  write  q(a)  to  denote  the  output  produced  by  executing  a  from 
state  q.  Thus,  as  for  preset  homing  sequences,  an  adaptive  homing  sequence  is  an  adaptive 
sequence  h  for  which  qx(h)  =  q2(h)  only  if  qih  =  q2h  for  all  q^,q2  €  Q. 

Modifying  the  algorithm 

We  are  now  ready  to  describe  how  the  algorithm  of  Figure  6  can  be  modified  to  use  adaptive 
rather  than  preset  homing  sequences.  The  structure  of  the  algorithm  is  not  changed  at  all.  Nor 
is  the  simulation  of  queries,  the  handling  of  over-sized  copies  of  X*,  etc.  Only  the  construction 
of  the  adaptive  homing  sequence  h  is  modified. 

Initially,  h  is  chosen  to  be  the  adaptive  sequence  consisting  just  of  a  root  node  A.  As  before, 
h  is  repeatedly  executed;  each  time,  its  output  selects  a  copy  of  X’ .  A  query  of  the  selected  copy 
is  then  simulated,  just  as  before.  Now,  however,  a  detected  inconsistency  is  handled  differently: 
Suppose  an  inconsistency  is  found  executing  x.  More  precisely,  suppose  that  on  two  different 
iterations,  we  began  in  states  q t  and  q2,  executed  h,  and  observed  output  qi(h)  =  92(h)  =  0. 
Further,  when  x  was  then  executed,  it  was  discovered  that  qih(x)  /  92/i(x).  This  latter  fact 
implies  that  q^h  ^  q2h,  and  so  h  cannot  be  an  adaptive  homing  sequence.  As  before,  we  would 
like  to  use  x  to  repair  h:  we  would  like  to  “graft”  x  onto  tree  h  so  that  the  resulting  tree  h' 
distinguishes  9,  and  q2. 

In  fact,  this  can  be  done  quite  easily:  Let  Vo  be  the  last  node  of  h  visited  when  h  is  executed 
from  91  and  q2  (it  must  be  the  same  node  in  both  cases  since  91(h)  =  92(h)),  and  let  x  =  bx  ■  ■  ■ br , 
where  6,  €  B.  Note  that  r0  has  no  7(9lh)-child  since  this  marks  the  “fall-off”  point  of  h.  The 
grafted  tree  h'  is  the  same  as  h  accept  that  in  h\  node  v0  has  a  'y(9ih)-child  which  is  the  root 
of  a  Unear  subtree  corresponding  to  the  execution  of  x.  More  precisely,  in  h\  each  node 
has  a  7 (qihbi  •  ■  - 6, _ j  )-child  t* ,  labeled  6,,  for  1  <  *  <  r. 

It  can  be  verified  that  qi(h')  ^  qi(h').  Thus  |Q(h)j  <  |Q(h')|  <  n,  and  so  h  will  be  grafted 
in  this  fashion  at  most  n  -  1  times. 

So  Une  13  in  Figure  6  is  replaced  by  a  call  to  a  grafting  subroutine  as  described  above. 
Further,  since  h  and  h'  are  the  same  except  for  node  v0,  it  is  no  longer  necessary  to  discard 
all  copies  of  X ’  —  it  is  sufficient  to  discard  only  X*,  the  copy  on  which  an  inconsistency  was 
discovered.  Thus,  since  |Q(h)|  increases  each  time  a  single  copy  of  X*  is  discarded,  at  most 
n  -  1  copies  are  ever  discarded  throughout  the  execution  of  the  algorithm.  Since  the  number 
of  copies  in  existence  at  any  one  time  is  also  bounded  by  n,  it  foUows  that  at  most  2 n  -  1 
copies  of  X*  are  simulated  by  this  modified  procedure.  Thus,  this  improves  the  bound  given  in 
Theorem  4.1  by  a  factor  of  9{n). 


5-4 


A  state-based  algorithm  for  general  automata  157 


5-4.5  Improving  Angluin’s  L’  algorithm 

In  this  section,  we  describe  a  variant  of  Angluin’s  L‘  algorithm  that  significantly  improves  the 
worst-case  number  of  membership  queries  made  by  the  inference  procedure.  This,  in  turn,  leads 
to  immediate  improvements  in  the  performance  of  our  homing  sequence  algorithms. 

As  mentioned  above,  Angluin’s  algorithm  maintains  an  observation  table  ( S,E,T ).  The 
function  or  table  T  records  the  value  T(x)  =  7(90*)  for  each  string  x  6  (S  U  SB)E.  (Here, 
qQ  is  £’s  initial  state  to  which,  in  Angluin’s  model,  the  automaton  can  always  be  reset.)  The 
entries  of  T  are  filled  in  using  membership  queries,  and  it  follows  that  the  number  of  queries 
needed  is  just  the  cardinality  of  (Sl)SB)E.  For  Angluin’s  algorithm,  |S|  is  bounded  by  O(mn), 
and  |£j  by  O(n).  Our  algorithm  improves  on  Anlguin’s  by  limiting  |S|  to  just  n;  however,  to 
achieve  this  bound  on  |5|,  nlgm  additional  queries  will  be  needed,  giving  an  overall  bound  of 
0(kn 2  +  nlogm)  on  the  required  number  of  membership  queries. 

As  mentioned  earlier,  5  is  a  prefix-closed  set  of  strings  representing  states  of  C.  Unlike 
Angluin’s  algorithm,  ours  maintains  the  condition  that  q0s  1  ^  q0s2  if  Sj  ^  s2  for  all  si,s2  €  S. 
Thus,  |S|  <  n  at  all  times.  Also,  S  only  grows  in  size  (strings  are  never  deleted  from  S).  The 
set  E  represents  a  set  of  experiments  which  distinguish  the  states  of  S  (i.e.,  the  states  q0s  for 

s€  5). 

Here  is  an  outline  of  our  algorithm,  which  is  very  similar  to  Angluin’s.  Initially,  S  and 
E  are  initialized  to  the  set  {A}.  Using  membership  queries,  fill  in  the  entries  of  table  T, 
and  make  ( S,E,T )  closed  (discussed  below).  Then,  from  ( S,E,T ),  construct  and  conjecture 
machine  S' .  If  the  conjecture  is  correct,  quit.  Otherwise,  update  the  set  E  using  the  returned 
counterexample,  and  repeat  until  a  correct  conjecture  is  made. 

We  say  observation  table  (5,  E,T )  is  closed  if  for  all  s  6  SB  there  exists  s'  €  5  such  that 
row{s)  =  roues').  (Recall  that  row(s)  is  that  function  /:£’—{ 0,1}  for  which  f(e)  =  T(se).) 
If  s  €  SB  witnesses  that  (S,E.T)  is  not  closed,  then  s  is  simply  added  to  5  (and  T  updated 
using  membership  queries).  Note  that  this  maintains  the  condition  that  all  rows  of  S  are  distinct 
(and  thus,  the  states  to  which  they  lead  are  also  distinct). 

(Angluin’s  algorithm  also  requires  that  the  observation  table  be  consistent ,  that  is,  that 
row{s\b)  =  rmi^s^b)  whenever  roic(si)  =  rotc(s2)  for  slts2  €  S  and  b  €  B.  However,  since  our 
algorithm  maintains  the  condition  that  row{s\)  ^  row(s 2)  for  5]  ^  s2,  this  condition  is  always 
trivially  satisfied.) 

The  conjectured  machine  S'  =  (Q\  B,6',q'0,i')  is  constructed  in  a  natural  manner:  its  state 
set  is  Q'  —  S  with  initial  state  q'0  =  A;  its  output  function  is  defined  by  7'(s)  =  T(s)\  and 
its  transition  function  is  given  by  S'(s.b)  =  s'  where  s'  is  that  unique  member  of  S  for  which 
row(sb)  =  roues'). 

Finally,  if  S'  is  different  from  S ,  a  counterexample  z  is  obtained,  and  the  set  E  must  be 
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updated.  Our  algorithm  adds  only  a  single  string  to  E  using  z.  However,  to  find  this  string, 
the  procedure  makes  up  to  lg  \z\  membership  queries. 

The  key  property  that  must  be  satisfied  by  the  new  experiment  e  (which  will  be  added  to  E) 
is  the  following:  for  some  s,s'  6  5  and  b  €  B  for  which  row(s )  =  row(s'b),  it  must  be  that 
7(<7o'Se)  ^  7 (q0s'be).  That  is,  experiment  e  must  witness  that  qos  and  qos'b  are  different  states. 
If  this  property  is  satisfied,  then  adding  e  to  E  will  cause  |5|  to  increase  by  at  least  one  (to 
maintain  closure)  so  that  the  total  number  of  equivalence  queries  is  bounded  by  n  —  1 . 

We  now  describe  how  such  an  experiment  can  be  found.  For  0  <  i  <  \z\,  let  p, ,  r,  be  such 
that  z  =  p,ri,  and  |p,|  =  i.  Let  =  6'(\,pi)  be  the  state  reached  in  S'  after  the  first  i  symbols 
of  z  have  been  executed.  Then  on  input  z ,  machine  €  reaches  a  state  outputting  the  value 
y(q0z)  =  j(qoS0r0).  (Assume  this  value  is  0.)  On  the  other  hand,  on  input  z,  machine  S' 
reaches  a  state  outputting  the  value  7'(£(A ,  z))  =  7'(s|*|)  =  T(s|z|)  =  7(9o'S|z|f‘pi)-  Since  z  is  a 
counterexample,  this  value  must  be  1. 

Let  a,  =  7(<7oSiri).  Note  that,  by  simulating  £',  s,  can  be  computed,  and  so  a,  can  be 

determined  with  a  membership  query  for  any  i.  From  the  comments  above,  we  have  that 

0 

a0  =  0  and  a|,|  =  1.  Using  a  kind  of  binary  search,  we  can  find  some  i  such  that  a*  ^  a,+i 
(such  an  i  clearly  must  exist):  first  we  query  0^1/2;  if  the  result  is  1,  then  query  a|,|/4;  otherwise, 
query  031,1/4,  etc.  In  this  manner,  such  an  i  can  be  found  in  lg|z|  queries. 

We  claim  then  that  ri+l  is  the  desired  experiment:  Let  b  be  the  first  symbol  of  rt.  Then 
7 {q0Sibri+l)  =  7(^0^)  =  Qi  ^  oi+i  =  7(9o'Si+ir«+i)-  However,  by  definition  of  we  have 
Si+l  =  <*'(Si,  6)  and  so  rote(5l+1)  =  row(sjb).  Thus,  as  argued  above,  adding  r,+1  to  E  causes 
|5|  to  increase.  It  follows  that  at  most  n  -  1  equivalence  queries  are  required  by  the  algorithm. 
For  each  equivalence  query,  lg  m  membership  queries  are  needed  to  find  the  right  experiment 
to  add  to  E.  Also,  since  |£|  <  n  and  |5|  <  n,  at  most  |(S  U  SB)E\  <  (k  +  1  )n2  membership 
queries  are  needed  to  record  the  entries  of  T.  Finally,  it  can  be  seen  that  each  membership 
query  has  length  at  most  n  +  m.  The  procedure  is  clearly  polynomial  time,  and  its  correctness 
follows  from  arguments  given  above  and  by  Angluin. 

In  a  quite  naive  implementation  of  the  algorithm,  the  rows  of  S  are  filled  in  first,  and,  once 
a  row  of  some  string  in  SB  has  been  completed,  it  is  compared  in  0(n 2)  time  to  every  other  row 
of  S  until  an  identical  row  is  discovered,  or  until  it  is  determined  that  there  is  no  other  identical 
row  in  S  (in  which  case,  (5,  £,T)  is  not  closed).  For  even  such  a  naive  implementation,  it  can 
be  verified  that  each  query  requires  processing  time  that  is  at  worst  proportional  to  the  bound 
of  0{n 2  +  nm)  on  the  length  of  h  in  the  algorithm  of  Figure  6. 

Combining  this  improvement  to  L’  with  the  adaptive  homing  sequence  ideas  described  in 
Section  5-4.4,  we  thus  have  shown: 


Theorem  4.2  There  exists  an  algorithm  that  halts  and  outputs  a  perfect  model  of  any  finite- 
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state  environment  €  with  probability  at  least  1-6.  The  algorithm’s  running  time,  and  the 
number  of  actions  executed  are  both  bounded  by 

0(n3(n  +  m)(nlog(n/6)  +  kn  +  logm)). 

5-5  A  diversity-based  algorithm  for  general  automata 

In  this  section,  we  describe  a  diversity- based  algorithm  for  inferring  finite  automata  in  the 
general  case.  The  idea  of  the  algorithm  is  to  construct  a  simple- assignment  automaton  that  is 
equivalent  to  the  update  graph. 

Our  algorithm  maintains  a  suffix-closed  set  T  of  tests  which  will  act  as  the  variables  of 
the  constructed  simple-assignment  automaton.  A  test  t  is  added  to  T  only  after  it  has  been 
determined  that  t  is  inequivalent  to  every  test  already  in  T.  Thus,  at  all  times,  |Xj  <  D ,  and 
each  test  of  T  represents  a  different  test-equivalence  class  (i.e.,  a  node  of  the  update  graph); 
naturally,  we  would  like  eventually  that  all  of  the  equivalence  classes  be  represented. 

Additionally,  a  function  or  table  r  :  BT  — ►  2T  is  maintained  with  the  interpretation  that 
r(x)  represents  those  tests  in  T  which  are  plausibly  equivalent  to  x.  That  is,  initially  r(x)  =  T, 
and  a  test  t  £  T  is  removed  from  r(x)  when  it  has  been  determined  that  t  ^  x. 

Note  that  if  \T\  =  D  (so  that  every  equivalence  class  is  represented  in  T),  and  if,  for 
all  x  €  BT,  r(x)  is  a  singleton  {sx}  for  some  sr  £  T,  then  sz  =  x  and  a  simple-assignment 
automaton  isomorphic  to  the  update  graph  is  easily  constructed:  its  variable  set  is  T,  its  output 
variable  is  A,  and  its  update  function  T  is  defined  by  T(t,6)  =  sbt.  (The  initial  values  function 
u;  is  handled  below.) 

The  simple-assignment  automata  conjectured  by  our  algorithm  are  constructed  from  T  and 
r  in  a  very  similar  manner.  We  choose  V  =  T  and  v0  =  A.  However,  in  general,  it  may  not  be 
the  case  that  |r(x)|  =  1  for  all  x  €  BT.  Therefore,  we  choose  T{t,b)  to  be  an  arbitrary  element 
of  r(bt).  If  one  or  more  of  our  choices  is  incorrect,  then  we  can  use  the  provided  counterexample 
to  correct  our  error.  More  precisely,  we  show  that,  using  experiments  and  counterexamples  to 
this  conjectured  automaton,  we  can  find  t  £  T  and  b  £  B  such  that  T(t,6)  ^  bt.  That  is,  our 
choice  for  T(t,b)  can  be  removed  from  r(bt).  Thus,  for  some  x  £  BT,  r(x)  is  reduced  in  size. 

Also,  note  that  if  r(x)  is  reduced  to  the  empty  set,  then  x  is  inequivalent  to  every  member 
of  T,  and  so  can  itself  be  added  to  T.  The  table  r  is  then  updated  appropriately.  Since  |T|  <  D, 
since  r(x)  C  T,  and  since  some  r(x)  shrinks  on  each  iteration,  it  follows  that  this  simplified 
algorithm  converges  to  a  perfect  model  after  at  most  ( k  -f  1)£>2  iterations. 
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5-5.1  An  algorithm  that  uses  a  provided  homing  sequence 

As  in  Section  5-4,  we  assume  initially  that  a  diversity-based  homing  sequence  h  is  given.  Later, 
we  show  how  h  can  be  constructed. 

Let  t  be  any  test.  Then  ht  is  equivalent  to  some  prefix  of  h.  For  selected  tests  t  in  A,  we 
maintain  candidate  sets  C(t )  C  {0, ...,|fi|}  representing  the  prefixes  of  h  which  are  plausibly 
equivalent  to  t.  Let  hi  denote  that  prefix  of  h  of  length  i.  Initially,  C(t)  =  {0, . . ., |fi|},  and, 
when  it  has  been  determined  that  hi  ^  ht,  index  i  is  removed  from  the  set.  Note  that  when  ht 
is  executed  from  some  state  q,  both  of  the  outputs  7(9/1,)  and  7(9/1/)  are  observed  since  is  a 
prefix  of  ht.  Thus,  if  we  find  these  outputs  differ,  then  clearly  /i,  ^  ht  and  so  i  can  be  deleted 
from  C(t). 

Suppose  h  has  been  executed  from  some  state  9  producing  output  a  =  (a0, . .  .,<7|/q).  We 
say  that  a  set  X  C  {0, . . .,  |/i|}  is  coherent  (with  respect  to  a)  if  <r,  =  a,  for  i,j  £  X.  If  A  is 
coherent,  then  the  common  value  of  all  with  i  £  X  is  called  X's  selected  value  (with  respect 
to  o),  and  it  is  denoted  o[X\. 

Note  that,  if  C\l)  is  coherent,  then  the  value  of  t  in  the  current  state  qh  is  known  —  it  is 
just  C(/)’s  selected  value.  On  the  other  hand,  if  candidate  set  C(t)  is  incoherent,  then  if  t  is 
executed,  at  least  one  element  of  C(t)  will  be  eliminated. 

What’s  more,  if  i  is  eliminated  from  C(t),  then  every  other  index  j  for  which  hi  =  h,  is 
also  removed  since  the  two  tests  have  the  same  value  in  every  state.  That  is,  | {[A.]  :  i  €  C(Z)}| 
decreases  by  at  least  one.  Thus,  C{t)  can  be  reduced  in  this  fashion  at  most  D  -  1  times. 

Also,  if  we  find  for  tests  t 1  and  Z2  that  C(Zi)  and  C(<2)  are  disjoint,  then  ti  and  <2  cannot 
possibly  belong  to  the  same  equivalence  class.  Moreover,  if  for  any  a  £  A  we  find  that  C(at  1) 
and  C{at2)  are  disjoint,  then  at\  £  a<2  an(}  therefore  tt  ^  t2.  This  is  the  primary  technique 
used  by  our  procedure  for  determining  inequivalence  of  tests  (and  thus  for  the  elimination  of 
tests  from  r(z)). 

Our  algorithm  maintains  a  candidate  set  for  each  t  £  T.  If  all  of  these  candidate  sets  are 
coherent  (after  h  has  been  executed  from  some  state  9),  then  the  value  of  every  test  t  £  T  is 

known  in  the  current  state  qh;  these  values  are  used  then  to  determine  the  function  w  in  the 

conjectured  automaton.  Specifically,  if  all  the  candidate  sets  for  the  tests  in  T  are  coherent, 
then  a  conjecture  may  be  made  in  which  V ,  v0  and  T  are  as  described  above,  and  u(t )  is  taken 
to  be  the  selected  value  of  C(t )  (which  is,  from  the  preceding  remarks,  the  value  of  t  in  the 
current  state). 

We  describe  next  how  a  counterexample  z  to  such  a  conjecture  S  is  handled.  The  technique 
is  similar  to  that  described  in  Section  5-4.5.  Let  z  =  p,s,  where  |p,j  =  i  for  0  <  i  <  |z|. 

Let  ti  =  T(A,s,).  Finally,  let  u,  =  p.Z*.  We  maintain  henceforth  a  candidate  set  for  each 

test  tq.  Our  hope  is  that  these  candidate  sets  will  be  reduced  to  the  point  that,  for  some  i, 
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C(iii)  nC(uI+i)  =  0.  For  if  this  happens,  then  we  can  conclude  that  u,  ^  ui+  j.  Noting  that 
u,  =  piU  and  ui+\  =  p,6t<+1  where  b  is  the  last  symbol  of  p,+i,  this  implies  that  t,  ^  bti+l.  Since 
ti  =  T(ti+X,b)  €  r(bti+l),  it  follows  that  can  be  deleted  from  r(bti+i)  as  desired. 

We  show  that  C(u0)  and  C(up|)  are  disjoint.  Thus,  if  the  candidate  sets  C(u,)  can  be 
sufficiently  reduced,  eventually  two  consecutive  sets  C(u,)  and  C(u,+i)  will  be  made  disjoint  as 
needed.  Note  that  the  conjectured  automaton  S  predicted  that  the  value  of  2  in  the  current 
state  qh  is  w(T(A,  2))  =  u;(t0).  Assume  this  value  is  0.  Then,  by  oj's  definition,  C(u0)  —  C(t 0)  C 
cr_1(0),  where  cr~1(x )  =  {0  <  i  <  \h\  :  <7,  =  2}.  On  the  other  hand,  since  2  is  a  counterexample, 
7 (qhz)  =  1.  Thus,  if  2  is  executed  from  the  current  state,  then  C(u|f|)  =  C(z)  will  be  included 
in  <r-1(l),  and,  as  claimed  C(u0)  0  C(ti|q)  will  be  empty. 

Unfortunately,  to  continually  reduce  the  sets  C(tij),  these  sets  must  continually  be  found 
incoherent.  This  may  be  a  problem  because  they  may  very  well  all  be  found  to  be  coherent 
without  any  consecutive  pair  being  disjoint.  To  handle  this  situation,  our  algorithm  makes  a 
new  conjecture  that  leaves  V ,  v0  and  T  alone,  but  which  chooses  u  appropriately  as  described 
above.  This  gives  a  new  sequence  of  tests  u,  for  which  candidate  sets  must  also  be  maintained. 
We  show  below  that  no  more  than  D  —  1  such  sequences  need  ever  be  started  by  the  algorithm 
before  one  of  the  sets  r(z)  is  reduced. 

The  complete  algorithm  is  shown  in  Figure  8.  In  the  figure,  when  a  counterexample  z  is 
received,  a  sequence  of  tests  u<0, . . . ,  utm(  is  constructed  as  described  above;  variable  £  counts  the 
number  of  such  counterexamples  received  for  the  same  choice  of  T.  The  set  K(i,j)  is  a  candidate 
set  for  test  u ,j.  Note  that  the  same  test  may  have  several  candidate  sets,  not  necessarily  the 
same:  even  if  u,,  =  u,/;-,  it  may  be  that  K(i,j)  ±  K(i',j')  if  (i,j)  7 £  (t',  j').  Although  this  may 
seem  inefficient,  it  appears  to  be  necessary  for  proving  the  algorithm’s  correctness. 

Also,  candidate  sets  are  updated  in  the  obvious  way:  if  t  €  T  is  executed  leading  to  a  state 
outputting  the  value  2,  then  C(t )  <—  C(t)  fl  o-1(x)  (and  similarly  for  sets  K(i,j)).  Note  that 
only  the  specified  candidate  set  is  modified. 

Theorem  5.1  The  algorithm  described  in  Figure  8  halts  in  polynomial  time  after  executing  at 
most 

0(kmD\\h\  +  D  +  m)) 
actions ,  and  outputs  a  perfect  model. 

Proof:  If  the  algorithm  halts,  then  it  outputs  a  perfect  model.  Therefore,  it  suffices  to  prove 
that  it  halts  having  executed  only  the  stated  number  of  actions. 

Most  of  the  arguments  needed  to  prove  this  theorem  were  given  above.  Here,  we  try  to  pull 
those  arguments  together,  filling  in  missing  details.  Below,  wre  say  that  a  property  holds  on 
each  iteration  if  it  holds  between  each  iteration  of  the  main  loop. 
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Input:  access  to  £,  a  finite-state  automaton 

h  -  a  diversity-based  homing  sequence  for  £ 

Output:  a  perfect  model  of  £ 

Procedure: 

1  T  *—  {A};  C(A) «—  {0, . . |#i|} 

2  r(b)  <—  T,  T(A,6)^-  A  for  6  €  B 

3  l  «-  0 

4  repeat 

5  execute  h,  producing  output  a 

6  if  C(t)  is  incoherent  for  some  t  £  T  then 

7  execute  t  and  update  C(t ) 

8  else  if  K(i,j)  is  incoherent  for  some  1  <  i  <  1 ,  0  <  j  <  m,  then 

9  choose  the  smallest  i  for  which  K(i,j)  is  incoherent  for  some  0  <  j  <  mi 

10  execute  u fj  and  update  A'(i,  j) 

1 1  else 

12  u(t)  -  <r[C(f)]  for  t  e  T 

13  conjecture  S  —  (T,  B,  T,  A,  w) 

14  if  S  is  a  perfect  model  then 

15  stop  and  output  S 

16  else 

17  obtain  counterexample  z 

18  I «—  i  +  1;  m<  *—  |z| 

19  for  0  <  j  <  mf: 

20  utj  *—  pj  ■  T(A  .Sj)  where  z  =  pjSj  and  \p}  \  =  j 

21  A'(f,j)-{0 . \h\} 

22  A(£,  0)*—  o  ’(wlu^ci)) 

23  execute  z  =  ulmt  and  update 

24  if  D  K(i,j  +  1)  =  0  for  some  1  <  i  <  £,  0  <  j  <  nti  then 

25  x  4  b0t0  where  +  1  =  pb0t0,  |p|  =  j  and  60  6  B  [this  implies  =  p-  T(<0>M] 

26  r(x)  —  r(x)  -  {T(/0.  M} 

27  if  r(x)  =  0  then 

28  r(t)  -  r(<)U{x}  for  t  £  BT  -  T 

29  r-ru{*}:C(i)-{0,...,|h|} 

30  r(bx)  <—  T  for  6  e  B 

31  T(t,  b)  —  any  member  of  r(bt)  for  b  €  B,  t  €  T 

32  £  *—  0 

33  end 


1  igure  8:  A  diversity-based  algorithm  for  inferring  £  given  a  diversity-based  homing  sequence. 
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First,  on  each  iteration,  if  i  &  C(t)  then  /i<  ^  ht  for  any  t.  This  follows  from  the  manner  in 
which  C  is  updated.  Also,  the  contrapositive  implies  that  C(t )  is  non-empty  on  each  iteration 
since  hi  =  ht  for  some  i  since  h  is  a  diversity- based  homing  sequence.  These  statements  hold 
aiso  for  candidate  sets  K(i,j).  (At  line  22,  this  follows  from  the  fact  that,  by  w’s  definition, 
CKo)  C  <T_1(w(u<0)).) 

These  facts  imply  that,  on  each  iteration,  t  £  r(x)  only  if  x  ^  t,  for  t  €  T,  x  6  BT.  Note 
also  that  r(x)  is  non-empty  on  each  iteration. 

Thus,  if  the  last  element  of  r(x )  is  eliminated,  then  x  t  for  all  t  €  T,  and  so  it  follows 
that  the  tests  in  T  are  pairwise  independent.  Thus,  by  definition  of  diversity,  \T\  <  D  on  each 
iteration,  and  so  lines  25-32  are  executed  at  most  ( k  +  1  )D2  times;  in  particular,  this  implies 
that  £  is  reset  to  zero  at  most  this  many  times. 

We  say  a  set  x  respects  set  y  if  either  rCyorzflj/  =  0. 

By  definition  of  equivalence,  and  also  because  of  the  manner  in  which  C  is  updated,  the  set 
{0  <  i  <  |/i|  :  hi  =  x}  respects  C(t)  for  any  tests  t  and  x,  on  each  iteration.  Thus,  C(t)  can  be 
reduced  in  size  at  most  D  —  1  times  (and  similarly  for  K(i,j)).  Combined  with  the  fact  that 
|T|  <  D,  this  implies  that  the  condition  at  line  6  is  satisfied  at  most  D(D  —  1)  times. 

We  will  show  that  £  <  D  —  1  on  each  iteration.  This  will  complete  the  theorem:  Since 
£  is  reset  to  zero  at  most  ( k  +  1  )D2  times,  the  condition  at  line  8  can  be  satisfied  at  most 
(fc  +  1  )mD2(D  -  l)2  times.  Also,  since  £  is  incremented  at  line  18,  the  conditions  at  lines  6 
and  8  can  fail  to  be  satisfied  at  most  (k  -f  1)D2(D  -  1)  times.  This  gives  us  an  overall  bound 
on  the  total  number  of  iterations  of  the  main  loop,  and,  since  at  most  |/i|  +  m  +  D  —  1  actions 
are  executed  on  each  iteration,  the  result  follows. 

Thus,  to  complete  the  proof,  we  show  that  £  <  D  -  1  on  each  iteration. 

First,  note  that  on  each  iteration.  K(i,j)C\  K(i,j  -f  1)  ^  0  for  1  <  i  <  £  and  0  <  j  <  rrii. 
Also,  because  the  smallest  i  is  chosen  at  line  9,  the  set  K{i,j)  is  reduced  by  the  algorithm  only 
when  K(i',j')  is  coherent  for  all  1  <  i'  <  i  and  0  <  j'  <  m */.  Also,  a  counterexample  is  obtained 
only  when  every  set  K(i,j)  is  coherent.  Thus,  it  follows  that  on  each  iteration  K(i',j')  respects 
K(i,j)  for  i'  <  i.  (When  K{i,j)  is  reduced,  either  all  or  none  of  the  elements  of  K(i'.j')  are 
deleted.) 

To  prove  £  <  D  -  1,  we  define  a  sequence  of  undirected  graphs  Go, . .  .,G*.  The  vertex  set 
of  each  graph  is  the  set  {0, . . .,  |/i|}.  In  G i,  an  edge  connects  two  vertices  r  and  s  if  and  only  if 
hr  =  h,  or  {r,  s}  C  K(i',j)  for  some  1  <  i'  <  i,  0  <  j  <  m 

We  will  be  interested  in  counting  the  number  of  connected  components  of  each  graph  G,. 
First,  note  that  G’0  has  at  most  D  connected  components  by  definition  of  diversity.  We  will 
show  that  each  graph  G,_i  has  at  least  one  more  connected  component  than  G<.  Since  every 
(non-empty)  graph  has  at  least  one  connected  component,  this  implies  that  £  <  D  —  1. 
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Since  the  edge  set  of  G,_i  is  a  subset  of  the  edge  set  of  <7,,  it  suffices  to  find  a  single  pair  of 
vertices  which  are  connected  in  G,,  but  not  in  G,-_i. 

As  argued  above  in  discussing  the  handling  of  counterexamples,  the  sets  K(i,  0)  and  A'(t,  m,  ) 
are  disjoint.  Let  r  and  s  be  respective  members  of  these  sets.  Then  r  and  s  are  connected  in 
Gi  because,  as  remarked  above,  K(i,j )  fl  K(i,j  +  1)  ^  0  on  each  iteration,  for  0  <  j  <  mt. 

We  claim  that  r  and  s  are  not  connected  in  G,_i.  For  if  they  were,  then  since  r  but  not  s 
is  contained  in  A'(i,0),  there  must  be  adjacent  vertices  r'  and  s'  on  the  path  from  r  to  s  for 
which  r'  but  not  s'  is  contained  in  K(i,0).  Since  r'  and  s'  are  adjacent,  either  hr ■  =  h,>  or 
{r',s'}  C  K{i\  j)  for  some  i'  <  i.  However,  as  already  argued,  either  case  implies  that  {r',s'} 
respects  A'(i,  0),  a  contradiction. 

This  completes  the  proof.  ■ 

So  Theorem  5.1  shows  that  an  effective  diversity-based  algorithm  exists  for  inferring  a  finite- 
state  environment,  assuming  a  diversity-based  homing  sequence  has  been  provided.  We  turn 
next  to  the  problem  of  extending  this  algorithm  to  handle  environments  when  such  a  sequence 
is  not  available. 

5-5.2  Constructing  a  homing  sequence 

As  in  the  algorithm  of  Figure  6,  we  presume  that  some  sequence  h  is  a  true  diversity-based 
homing  sequence  until  it  becomes  necessary  to  extend  and  improve  h.  Initially,  h  =  A.  If 
for  some  test  i,  candidate  set  C(z)  is  reduced  to  the  empty  set,  then  clearly  h  cannot  be  a 
diversity-based  homing  sequence  since  this  implies  that  hx  is  inequivalent  to  every  prefix  of  h. 
We  therefore  replace  h  with  hx  as  is  done  in  the  algorithm  of  Figure  4.  Since  more  equivalence 
classes  are  represented  by  the  prefixes  of  hx  than  by  those  of  h,  it  follows  that  h  must  converge 
to  a  correct  homing  sequence  if  extended  in  this  fashion  at  most  D  —  1  times. 

Our  extended  algorithm  is  quite  similar  to  the  one  given  in  Figure  8.  As  before,  we  maintain 
a  set  T  and  function  r,  which  together  record  inequivalences  determined  among  the  tests.  Now, 
however,  the  problem  of  determining  that  two  tests  are  inequivalent  becomes  more  difficult:  we 
saw  earlier  that  if  h  is  a  diversity-based  homing  sequence  and  C(x)  fl  C(y)  =  0  for  two  tests  x 
and  y ,  then  x  ^  y.  However,  if  h  is  not  a  diversity- based  homing  sequence,  this  conclusion  may 
be  false  —  it  may  be  that  x  and  y  are  equivalent  to  one  another,  but  not  to  any  prefix  of  h. 

Nevertheless,  we  show  that  if  x  and  y  are  in  fact  equivalent,  then  by  re-running  these  tests 
repeatedly  in  an  appropriate  manner,  we  can  with  high  probability  eliminate  all  the  elements 
of  one  of  the  candidate  sets,  thus  yielding  an  extension  to  h  as  described  above. 

Suppose  that,  having  executed  h.  we  find  that  C(x)  and  C(y)  are  coherent,  and  furthermore, 
that  their  selected  values  are  different.  If  x  =  y  then,  by  definition  of  equivalence,  the  true 
values  of  the  two  tests  in  the  current  state  are  equal.  Thus,  the  selected  value  of  one  of  the 
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Input:  access  to  Z,  a  finite-state  automaton 
D  -  the  diversity  of  Z 
6  -  desired  confidence 
Output:  a  perfect  model  of  Z 
Procedure: 

1  h  —  A 

2  60  -  «/((/?  -  1)((*  +  1)D2  +  D  -  1)) 

3  initialize  T,  r,  T,  C  and  i  as  in  Figure  8  (lines  1-3) 

4  repeat 

5  execute  h,  producing  output  a 

6  if  C(t)  is  incoherent  for  some  t  €  T  then 

7  execute  t  and  update  C(t) 

8  else  if  (Jj=o  A'(t ,j)  is  incoherent  for  some  1  <  i  <  l  then 

9  choose  the  smallest  i  for  which  this  is  so 

10  if  K(i,j )  is  incoherent  for  some  0  <  j  <  then 

11  execute  Uij  and  update  K(i,j) 

12  else  [m,  =  1,  A'(t,0)  fl  I\(i,  1)  =  0  and  «t[A'(i,0)]  ^  <r[A'(i,  1)]] 

13  choose  j  €  {0,1}  randomly 

14  execute  and  update  K(i,j) 

15  if  A'(i,  j)  ^  0  then 

16  s{  <—  Si  +  1 

17  if  Si  >  lg(l/$0)  then 

18  conclude  «l0  £  update  r,  T,  C,  T  as  in  Figure  8  (lines  25-31) 

19  l  *-  0 

20  else 

21  make  conjecture:  handle  returned  counterexample  as  in  Figure  8  (lines  12-23) 

22  $(  * —  —  1 

23  if  K(i,j)  =  0  for  some  1  <  i  <  1 ,  0  <  j  <  mi  then 

24  h  <—  huij 

25  C(t)-{0 . \h\}  for<€T 

26  t  -  0 

27  else  if  K(i,j)C\  K(i,j  +  1)  =  0  and  s,  <  0  for  some  1  <  i  <  i,  0  <  j  <  m,  then 

28  s,  —  0 

29  u, o  —  Uij\  u, !  —  ui  j+ 1;  1 

30  end 


Figure  9:  A  diversity-based  algorithm  for  inferring  Z. 

candidate  sets  must  disagree  with  the  common  value  of  the  two  tests  in  the  current  state.  If  this 
is  the  case  for  x  (say),  and  x  is  executed,  then  C(x)  will  be  reduced  to  the  empty  set  (and  hx 
can  replace  h).  In  general,  if  the  algorithm  randomly  chooses  which  of  x  or  y  to  execute,  then 
with  probability  1/2,  the  candidate  set  of  the  chosen  test  is  emptied.  Of  course,  by  repeating 
such  an  experiment  many  times,  we  can  lift  our  confidence  to  arbitrarily  high  levels. 
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This  then  is  the  approach  used  by  our  extended  algorithm  (Figure  9)  in  determining  test 
inequivalence.  The  algorithm  proceeds  just  as  before.  Now,  however,  when  the  candidate  sets 
of  two  tests  are  found  to  be  disjoint  (line  23),  the  procedure  does  not  immediately  conclude 
that  the  tests  are  inequivalent.  Rather,  it  keeps  the  two  candidate  sets  around  and,  when  given 
the  opportunity,  re-runs  the  two  tests  as  described  above.  Only  after  the  tests  have  been  re-run 
many  times  with  neither  of  the  candidate  sets  emptying  does  the  algorithm  conclude  that  the 
tests  are  inequivalent. 

Theorem  5.2  The  algorithm  of  Figure  9,  with  probability  at  least  1  —  6,  halts  and  outputs 
a  perfect  model.  The  algorithm’s  running  time,  and  the  number  of  actions  executed  are  both 
bounded  by 

0{kD\m  +  D)(mD  +  log  (kD/6))). 

Proof:  The  proof  of  this  theorem  is  quite  similar  to  the  proof  of  Theorem  5.1.  As  before,  we 
need  only  show  that  the  algorithm  halts  in  the  stated  number  of  steps  since  it  only  halts  when 
a  perfect  model  has  been  found. 

As  before,  i  C(t)  only  if  ^  ht  for  any  test  t  and  0  <  i  <  |h|.  Similarly  for  A(i,j). 

In  the  algorithm,  the  variable  s<  serves  two  purposes:  When  the  sequence  of  tests  ui0, . . . ,  uim, 
is  first  created  (lines  21-22),  s*  is  set  to  —1.  Variable  s,  remains  negative  until  the  candidate 
sets  of  two  consecutive  tests  in  this  sequence  are  reduced  to  the  point  that  they  are  disjoint. 
At  this  point,  s<  becomes  a  (non-negative)  counter  indicating  how  many  times  the  tests  ui0  and 
u,i  have  been  rerun  without  either  candidate  set  emptying.  It  is  easily  verified  that,  on  each 
iteration,  K(i,j)n  K(i,j+  1)  /  0  for  0  <  j  <  m,  if  <  0,  and  m,  =  1  and  A'(t,0)nA’(t,  1)  =  0 
if  s,  >  0.  Note  that  this  implies  at  line  8  that  (J;  A '(i,j)  is  coherent  if  and  only  if  every  I\(i,j) 
is  coherent  and  A'(t,0)  and  A'(i,  l)’s  selected  values  agree. 

The  algorithm  is  randomized,  and  can  only  be  shown  to  behave  correctly  when  certain  low 
probability  events  do  not  occur.  Therefore,  to  simplify  the  analysis,  we  will  assume  that  the 
algorithm  has  a  good  run  —  specifically,  that  if  ui0  =  uti  then  Si  does  not  exceed  lg(l/60),  for 
1  <  t  <  f.  Later,  we  will  show  that  a  good  run  occurs  with  probability  at  least  1  —  6. 

Assuming  then  that  a  good  run  occurs,  it  is  clear  that  t  £  r(x)  only  if  x  ^  t  for  t  6  T, 
x  €  BT.  Thus,  all  pairs  of  tests  in  T  are  inequivalent,  and  |7j  <  D.  Further,  this  shows  that 
lines  18-19  are  executed  no  more  than  ( k  -)-  1)Z)2  times. 

As  argued  above,  h  cannot  be  extended  more  than  D  —  1  times,  implying  that  lines  24-26  are 
executed  at  most  D  -  1  times.  Thus,  variable  f  is  reset  to  zero  no  more  than  (k  +  1  )D2  +  D  —  1 
times.  Later  we  will  again  argue  that  (  <  D  —  1  on  each  iteration;  assume  for  now  that 
this  is  the  case.  Then  since  s<  <  lg(l/60)  on  each  iteration,  lines  13-19  are  executed  at  most 
( D  -  !)((*+  1)£>2  +  D-  1 )  lg(  l/60)  times. 
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It  can  be  verified,  as  in  Theorem  5.1,  that  the  set  {0  <  t  <  |A|  :  A,-  =  x}  respects  C(t )  for  any 
tests  t  and  x,  on  each  iteration  (and  likewise  for  K(i,j)).  Applying  our  bound  on  £  and  on  the 
number  of  times  £  is  reset,  this  implies  line  11  is  executed  at  most  (D  -  1  )2m((&+  \)D2  +  D  -  1) 
times,  and  so  the  condition  at  line  8  is  satisfied  at  most  (D  -  l)((k+  1)Z)2  +  D  —  1)((Z>-  l)m  + 
lg(l/M)  times. 

The  sets  C(t)  for  t  €  T  are  reset  to  {0,...,|/i|}  at  most  D  —  1  times  (i.e.,  only  when  h  is 
extended).  Thus,  the  condition  at  line  6  is  satisfied  at  most  D(D  -  l)2  times. 

Finally,  since  £  is  bounded  and  is  executed  at  line  21,  the  conditions  at  lines  6  and  8  fail  to 
be  satisfied  at  most  ( D  -  l)((Jfc  +  l)D2  +  D  —  1)  times.  Thus,  the  number  of  iterations  of  the 
outer  loop  can  be  computed,  and  the  bound  on  the  number  of  actions  executed  follows  from 
the  fact  that  |h|  <  (D  -  l)(m  +  15-1). 

The  proof  that  £  <  D  —  1  is  quite  similar  to  that  given  in  the  proof  of  Theorem  5.1.  As 
before,  we  define  graphs  G0,...,G*  on  vertex  set  (0, . . .,  \h\}.  We  let  {r,  s)  be  an  edge  of  G* 
if  and  only  if  hr  =  h,  or  {r,  s}  C  A (i',j)  for  some  1  <  i'  <  i.  Then  G0  has  at  most  D 
connected  components.  It  can  be  argued  as  before  that  A \i',j)  is  respects  if  i'  <  i. 

Also,  A'(z,  0)  and  A'(i,m,)  are  disjoint.  Therefore,  if  r  is  in  A'(t,0)  and  s  is  in  A'(i,  m<)  then  r 
and  s  are  connected  in  G,  but  not  in  G<_ i  by  the  argument  given  in  the  proof  of  Theorem  5.1. 
Thus,  G,_i  has  at  least  one  more  connected  component  than  G,,  and  £  <  D  -  1. 

Thus,  we  have  proved  that  the  stated  bound  on  the  number  of  actions  executed  holds  on  a 
good  run.  It  remains  then  only  to  show  that  a  good  run  occurs  with  probability  at  least  1  -  6. 

As  argued  above,  if  A' ( i ,  0 )  and  A'(*\  1)  are  coherent  with  different  selected  values,  and  if 
u,o  =  u,i,  then  the  probability  is  1/2  that  K(i,j)  is  empty  after  is  executed,  for  j  chosen 
randomly  from  {0,1}.  Thus,  if  ui0  =  utl,  then  the  probability  that  s,  exceeds  k  is  less  than 
2~k.  particular,  s,  exceeds  lg(l/^0)  with  probability  less  than  <50. 

We  argued  above  that,  on  a  good  run,  lines  21-22  are  executed  at  most  (D  —  1  )((A+  1  )D‘  + 
D  -  1)  times:  that  is.  at  most  this  many  pairs  Ui0,Un  are  created.  The  chance  that  s,  exceeds 
lg(l/6o)  when  u,0  =  uu  for  any  of  these  pairs  is  thus  bounded  by  S.  Thus,  6  bounds  the 
probability  of  a  bad  run.  completing  the  proof  of  the  action  execution  bound. 

It  is  clear  that  this  algorithm  runs  in  polynomial  time.  It  is  not  so  obvious,  however,  how 
it  can  be  implemented  to  run  in  time  proportional  to  the  bound  on  the  number  of  actions 
executed.  We  discuss  techniques  that  can  be  used  to  achieve  such  a  time  bound. 

Perhaps  the  most  time  consuming  task  performed  by  the  algorithm  is  in  checking  the  co¬ 
herence  of  the  many  candidate  sets.  In  a  naive  implementation,  determining  the  coherence  of 

a  subset  of  {0, - |/i|}  takes  0(|/i|)  time.  Thus,  for  instance,  checking  the  |J|  candidate  sets 

C(t )  at  line  6  takes  up  to  0(D\h\ )  time;  since  only  0(\h\  +  D  +  m)  actions  are  executed  on  each 
iteration,  this  gives  a  time  bound  that  exceeds  the  action  execution  bound  by  at  least  a  factor 
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of  D. 

We  show  instead  how  the  coherence  of  any  candidate  set  can  be  checked  in  0(D)  time 
using  a  different  representation:  We  maintain  a  partition  n  of  the  set  {0, with  the 
interpretation  that  i  and  j  are  in  the  same  block  of  tt  if  and  only  if  every  time  that  h  was 
previously  executed  (line  5),  it  was  observed  that  cr,  =  <jj  (where  a  was  the  observed  output 
sequence,  as  usual).  In  particular,  if  hi  =  hj,  then  i  and  j  are  always  in  the  same  block  of  n. 
Thus  |xj  <  D. 

Note  that  if  such  a  partition  is  maintained,  then  on  each  iteration,  each  block  of  n  respects 
each  candidate  set  C(i)  or  K(i,j).  It  therefore  makes  sense  to  represent  each  candidate  set  as 
a  set  of  pointers  to  the  blocks  of  r  that  it  includes.  If  this  is  done,  then  each  candidate  set 
contains  at  most  D  pointers,  and  each  set’s  coherence  can  be  determined  in  0(D)  time  (it  is 
only  necessary  to  examine  the  value  of  one  member  of  each  block  since  all  the  other  members 
have  the  same  value). 

It  is  quite  easy  to  see  how  the  partition  it  can  be  maintained:  Initially,  and  each  time  h  is 
extended,  it  is  set  to  {{0, . . .,  |h|}}.  After  h  is  executed  with  output  a  at  line  5,  the  coherence 
of  each  block  of  it  is  determined.  Since  each  index  0,...,|/i|  occurs  in  only  one  block  of  tt, 
this  only  takes  0(|/i|)  time.  If  any  block  s  is  incoherent,  then  it  is  split  into  two  new  blocks 
s  n  cr~ 1  (0)  and  s  n  <r-1(l).  Since  |7r|  <  D,  this  can  happen  at  most  D  -  1  times.  Naturally, 
when  it  does  happen,  all  of  the  candidate  sets  must  be  changed  so  that  their  members  point 
to  blocks  of  the  new  partition.  This  takes  0(D)  time  for  each  of  the  O(mD)  candidate  sets. 
Thus,  since  h  can  be  extended  at  most  D  —  1  times,  the  algorithm  spends  at  most  0(mD4) 
time  updating  candidate  sets  in  this  fashion.  (This  time  is  negligible  compared  to  the  number 
of  actions  executed.) 

This  still  does  not  give  the  desired  time  bound  because,  even  with  this  modification,  naively 
computing  l  unions,  each  of  up  to  m  candidate  sets  as  at  line  8,  can  take  0(mD2)  time. 
Instead  of  the  naive  approach,  we  therefore  maintain  a  counter  e(i,s)  for  each  1  <  i  <  i 
and  s  £  it.  This  counter  indicates  the  number  of  candidate  sets  K(i,j )  which  include  s: 
c(i,s)  =  |{0  <  j  <  mi  :  s  C  A'(i,  j)}|-  It  is  straightforward  how  such  a  counter  can  be  efficiently 
maintained,  and  the  union  (J,  A '(i.j)  can  now  be  easily  computed  in  0(D)  time  as  the  union 
of  those  blocks  s  €  *  for  which  e(i,  s)  >  0. 

Finally,  lines  23  and  27,  which  appear  to  require  a  great  deal  of  search,  actually  do  not 
because  only  a  small  number  of  values  i,j  (those  for  which  K(i,j)  was  modified)  need  actually 
be  checked. 

With  these  ideas,  it  can  now  be  fairly  easily  verified  that  the  algorithm  halts  within  the 
stated  time  bound.  ® 
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5-6  A  state-based  algorithm  for  permutation  automata 

In  this  and  the  next  sections,  we  present  algorithms  for  inferring  permutation  automata.  Unlike 
the  procedures  described  up  to  this  point,  these  procedures  do  not  rely  on  a  means  of  discover¬ 
ing  counterexamples;  the  procedures  actively  experiment  with  the  unknown  environment,  and 
output  a  perfect  model  with  arbitrarily  high  probability. 

As  before,  we  describe  both  a  state-based  and  a  diversity- based  procedure.  In  both  cases, 
we  describe  deterministic  procedures  that,  given  a  (diversity-based)  homing  sequence  h,  will 
output  a  perfect  model  of  the  environment  in  time  polynomial  in  n  (or  D)  and  \h\.  To  construct 
the  needed  homing  sequence,  we  show  that  any  sufficiently  long  random  sequence  of  actions  is 
likely  to  be  a  homing  sequence. 

We  begin  in  this  section  with  the  state-based  case.  Consider  first  the  simpler  problem  of 
inferring  a  visible  automaton,  i.e.,  one  in  which  the  identity  of  each  state  is  readily  observable. 
For  instance,  suppose  each  state,  instead  of  outputting  0  or  1,  outputs  its  own  name.  In 
this  situation,  inference  of  the  automaton  is  almost  trivial.  From  the  current  state  q,  we  can 
immediately  learn  the  value  of  S(q,b)  by  simply  executing  b  and  observing  the  state  reached. 
If  S(q,b)  is  already  known  for  all  the  basic  actions,  then  either  we  can  find  a  path  based  on 
what  is  already  known  about  6  to  a  state  for  which  this  is  not  the  case,  or  we  have  finished 
exploring  the  automaton.  It  is  not  hard  to  see  that  0(kn 2)  actions  are  executed  in  total  by  this 
procedure. 

Now  suppose  that  the  unknown  environment  £  is  a  permutation  automaton  and  that  a 
noming  sequence  h  has  been  provided.  Because  £  is  a  permutation  environment,  we  can  easily 
show  that  h  is  also  a  distinguishing  sequence,  that  is,  h  distinguishes  every  pair  of  unequal  states 
of  £.  Put  another  way,  q\(h)  =  g2(/i)  if  and  only  if  q\  =  g2.  (For  if  qi{h)  =  g2(/i)  then,  since  h 
is  a  homing  sequence,  qxh  =  q7h.  This  implies  qi  =  g2  since  £  is  a  permutation  environment.) 
Thus,  the  identity  of  any  state  is  uniquely  given  by  the  output  of  h  at  that  state;  its  identity  is 
almost  directly  observable. 

To  infer  the  environment,  we  therefore  use  the  inference  procedure  sketched  above  for  visible 
automata.  Each  state  q  is  named  or  represented  by  q(h),  the  output  of  h  at  that  state.  To 
identify  the  current  state,  simply  execute  h  and  observe  the  output  produced. 

Although  executing  h  is  helpful  in  identifying  the  state  from  which  the  sequence  was  exe¬ 
cuted,  doing  so  is  also  likely  to  leave  us  in  a  state  at  the  end  of  the  sequence  whose  identity  is 
unknown.  This  is  a  problem  because  the  visible-automaton  inference  procedure  requires  that 
we  be  able  to  find  a  state  whose  identity  is  known  even  without  executing  h.  We  can  overcome 
this  problem,  however,  by  maintaining  a  table  u  which  records  the  fact  that  if  o  =  q{h)  was 
just  observed  as  the  output  of  executing  h,  then  the  output  of  h  if  executed  from  the  current 
state  qh  is  given  by  u(e). 
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Input:  access  to  £,  a  permutation  automaton 
h  -  homing  sequence 
Output:  a  perfect  model  of  £ 

Procedure: 

1  d,  u  are  initially  undefined  everywhere 

2  execute  h ,  producing  output  o 

3  repeat 

4  if  u(<t)  is  not  defined  then 

5  execute  h,  producing  output  r 

6  u(a)  «—  r 

7  a  *—  t 

8  else  if  (3a  €  A,  6  6  B)  d(u(a),a)  is  defined,  b\‘.  d(u(a),ab)  is  undefined  then 

9  choose  the  shortest  such  ab 

10  a  *—  d(u(cr),a) 

11  execute  ab 

12  execute  h.  producing  output  r 

13  d(a,b)  —  r 

14  a  ♦—  r 

15  else 

16  exit  loop 

17  end 

18  let  q  be  the  current  state 

19  output  the  following  prediction  rule  (model  of  £): 

20  on  input  a  6  A, 

21  a  *—  d(u(o),  a) 

22  predict  7 (qa)  =  a0 


Figure  10:  A  state-based  algorithm  for  inferring  permutation  environment  £. 

Thus,  we  can  reach  a  state  whose  identity  is  known  (without  executing  h  from  it),  we  can 
execute  an  experiment  as  dictated  by  the  visible-automaton  inference  procedure,  and  we  can 
identify  the  last  state  reached  by  executing  h.  This  can  of  course  be  repeated  as  many  times 
as  necessary. 

Our  procedure  is  given  in  Figure  10.  As  mentioned,  each  state  q  is  represented  by  q(h),  the 
output  of  h  at  q.  For  o  €  Q(h),  we  write  q„  to  denote  that  state  for  which  q„(h)  =  a.  This 
state  is  well-defined  since  h  is  a  distinguishing  sequence.  A  function  or  table  u  :  Q{h)  — *■  Q{h) 
is  maintained  for  which  u(a)  =  q„h(h).  That  is,  if  h  was  just  executed  with  output  o.  then  the 
current  state  is  gU(„). 

The  transition  function  is  represented  by  the  program  variable  d  :  Q(h)  x  B  —  Q(h).  For 
notational  purposes,  the  function  d  can  be  extended  in  the  usual  manner  to  the  domain  Q(h)  x  A. 
The  variable  d  is  used  to  store  and  compute  the  output  of  h  in  future  states.  Given  0  €  Q{h) 
and  6  €  B,  d(o,b)  denotes  the  output  of  h  in  state  q„b.  That  is.  if  properly  constructed. 
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d{o,  b)  =  qab(h). 

Theorem  6.1  The  algorithm  of  Figure  10  halts  and  outputs  a  perfect  model  of  Z  after  executing 
at  most  0(&n(|/i|  +  n))  actions,  and  in  time  0(kn(\h\  +  kn )). 

Proof:  Clearly,  <  n,  so  that  after  at  most  n  +  kn  iterations,  the  procedure  will  halt 

since  every  entry  of  u  and  d  will  be  defined. 

We  can  view  d  as  defining  a  directed  graph  whose  vertices  are  the  elements  of  Q(h),  and 
whose  edges  are  of  the  form  a  — >  d(o,  b)  whenever  a  €  Q(h),  b  €  B  and  d(o,  b)  is  defined.  Then 

the  problem  of  finding  an  experiment  ab  as  in  the  figure  can  be  treated  as  that  of  finding  a  path 

in  the  graph  from  u(o)  to  another  vertex  a  whose  out-degree  is  less  than  k.  This  is  easily  done 
in  O(kn)  time  (for  instance,  using  breadth-first  search),  and  the  resulting  experiment  ab  has 
length  at  most  n,  the  size  of  the  graph.  This  proves  the  upper  bound  on  the  number  of  actions 
executed. 

The  remaining  steps  of  the  loop  can  be  achieved  in  0(|h|)  time,  for  instance,  if  we  store  the 
elements  of  Q{h)  at  the  leaves  of  a  depth  (\h\  4-  1)  binary  tree.  It  remains  then  only  to  show 
that  the  prediction  rule  output  by  the  algorithm  is  a  perfect  model  of  Z . 

We  prove  this  by  showing  that  the  following  invariants  hold  between  each  iteration  of  the 
main  loop: 

1.  If  a  €  Q(h)  and  u(o)  is  defined,  then  u{a)  =  qah(h). 

2.  If  a  €  Q{h),  b  €  B  and  d{(7.b)  is  defined  then  d(a,b)  =  qab(h). 

Initially,  these  invariants  hold  vacuously  since  u  and  d  are  undefined  everywhere.  Suppose  at 
the  top  of  an  iteration  of  the  loop  that  h  was  just  executed  from  some  state  q  with  output  a. 
Then  q  =  q„,  and  the  current  state  is  qah.  If  u(er)  is  undefined,  then  h  is  executed  from  the 
current  state  with  output  r.  Thus,  we  learn  that  r  =  qah(h).  Setting  u(o)  to  r,  invariant  1  is 
maintained. 

On  the  other  hand,  if  u(a)  is  defined,  then  the  current  state  is  <?„(<?)•  If  an  experiment  ab  is 
found  as  shown  in  the  figure,  then  invariant  2,  together  with  an  easy  induction  argument  on  the 
length  of  a,  shows  that  a  =  d(u(a),a)  =  qu^„)a{h).  The  state  we  reach  by  executing  a  then  is 
just  qa.  Executing  6,  and  then  h  with  output  r,  we  learn  that  r  =  qab(h).  Setting  d(a,b)  =  r, 
invariant  2  is  maintained. 

With  these  invariants,  it  is  not  hard  to  see  why,  after  the  loop  is  exited,  the  output  prediction 
rule  is  correct.  The  current  state  q  is  just  qU(a)  as  before.  Given  a  £  A.  we  have  qa(h)  = 
d(u{a),a).  Therefore,  the  first  element  of  d(u(cr),a)  is  7 (qa).  ■ 

Finally,  we  must  consider  how  to  construct  h.  In  fact,  any  sufficiently  long  random  sequence 
of  actions  is  likely  to  be  a  homing  sequence: 
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Theorem  6.2  Let  6  >  0,  and  let  h  be  a  random  sequence  of  length  8kn5  •  ln(n)  •  ( n  +  ln(l/6)). 
Then  h  is  a  homing  sequence  with  probability  at  least  1  —  6. 

Proof:  The  idea  is  to  randomly  construct  the  homing  sequence  in  the  manner  described  in 
Figure  3.  On  each  iteration,  an  appropriate  extension  x  which  distinguishes  some  pair  of  states 
as  needed  by  the  algorithm  is  likely  to  be  given  by  any  sufficiently  long  random  walk.  This 
follows  from  previous  results  on  random  walks  in  permutation  automata.  Specifically,  we  will 
use  the  following  result: 

Lemma  6.3  Let  qt  and  q 2  be  two  distinct  states  and  let  x  be  a  random  sequence  of  length 
2kn*  ln(n)  of  the  following  form:  At  each  step,  with  equal  probability,  we  either  do  nothing,  or 
we  execute  a  uniformly  and  randomly  chosen  basic  action  from  B.  Then  the  probability  that 
~/(q\x)  ^  7(922)  w  at  least  1/2 n. 

Essentially,  the  same  result  is  proved  in  my  master’s  thesis  [79]  using  results  of  Fiedler  [24] 
on  the  eigenvalues  of  doubly  stochastic  matrices,  in  addition  to  certain  properties  of  point- 
symmetric  graphs.  There,  the  result  was  proved  for  update  graphs,  but,  because  of  the  “dual” 
relationship  between  update  graphs  and  finite  automata,  the  results  holds  as  stated  as  well. 

Let  *i,...,xr  be  a  sequence  of  random  strings,  each  of  length  2kn4  ln(n).  Let  = 
x  1*2 •••*!.  We  wish  to  show  that  yr  is  a  homing  sequence  with  high  probability.  Consider 
a  sequence  of  trials  in  which  “success”  on  the  ith  trial  means  that  either  y,_j  is  a  homing 
sequence  (so  that  is  as  well)  or  |Q(j/i)|  >  |Q(j/i-i)l-  Clearly,  if  n  of  the  trials  succeed,  then 
yr  is  a  homing  sequence. 

For  any  choice  of  Lemma  6.3  shows  that  the  probability  of  success  on  the  ith  trial  is 
at  least  1/2 n.  Thus,  applying  Chernoff  bounds  (Lemma  2-3.6),  we  see  that  the  probability  of 
fewer  than  n  successes  in  r  trials  is  at  most  <5  if  r  >  4 n(n  +  ln(  1  /<5)).  This  proves  the  theorem. 

■ 

These  theorems  give  our  inference  procedure  a  running  time  of  0(k2n6  log(n)-(n+log(l/^))). 

5-7  A  diversity- based  algorithm  for  permutation  automata 

We  can  show  in  a  similar  manner  how  a  permutation  environment  can  be  inferred  using  a 
diversity-based  representation.  As  before,  we  reduce  the  problem  to  that  of  inferring  a  visible 
automaton  —  in  this  case,  one  for  which  all  of  the  test-equivalence  classes  are  known,  and  for 
which  the  value  of  each  test  class  is  observable  in  every  state.  The  problem  of  inferring  such 
automata  is  solved  in  Chapter  4  of  my  master's  thesis  [79];  the  solution  is  based  on  the  careful 
planning  of  experiments,  and  on  the  maintenance  of  candidate  sets  similar  to  those  described 
in  Section  5-5. 
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Let  ft  be  a  given  diversity-based  homing  sequence  for  the  unknown  permutation  environ¬ 
ment  £.  As  before,  to  simulate  the  inference  algorithm  for  visible  automata,  it  suffices  to  show 
that  the  state  of  the  automaton  (i.e.  the  values  of  the  test  classes)  can  be  observed  by  executing 
h,  and  further  that  it  is  possible  to  reach  a  state  whose  identity  is  known  even  without  executing 
h.  Since  £  is  a  permutation  environment,  we  can  show  that  every  test  class  is  represented  by 
some  prefix  of  h.  Therefore,  at  the  current  state  q,  the  values  of  all  the  test  classes  can  be 
observed  simply  by  executing  h. 

If,  having  executed  h  from  some  state  q,  we  find  that  candidate  set  C(h,)  is  coherent,  then 
the  value  of  test  in  the  current  state  qh  is  just  the  selected  value  of  C(ht).  (As  before,  /i,  is 
the  prefix  of  h  of  length  *.)  Thus,  if  all  the  candidate  sets  are  coherent,  then  qh(h),  the  output 
of  the  entire  sequence,  is  known  in  the  current  state.  On  the  other  hand,  if  one  of  the  candidate 
sets  is  incoherent,  then  by  re-executing  h  we  are  guaranteed  to  reduce  one  of  the  candidate 
sets.  Thus,  we  can  quickly  reach  a  state  in  which  the  output  of  h  is  known  without  actually 
executing  it. 

We  say  action  sequence  a  is  a  diversity-based  distinguishing  sequence  if  every  test  is  equiv¬ 
alent  to  some  prefix  of  a.  Such  a  sequence  is  clearly  a  distinguishing  sequence,  since  if  qx  ^  q7 
then  there  exists  a  test  t  distinguishing  the  two  states;  since  t  =  p  for  some  prefix  p  of  a, 
liQiP)  ±  7(92 P)  and  so  q^a)  £  q2{a). 

A  diversity-based  distinguishing  sequence  is  also  a  diversity- based  homing  sequence,  as  is 
obvious  from  their  definitions.  In  permutation  environments  (but  not  in  general),  the  converse 
holds:  Suppose  h  is  a  diversity-based  homing  sequence.  Let  [rx],  [t2], . .  .[to]  be  the  equivalence 
classes  of  £.  Then  there  exist  prefixes  Pi,P2, .  •  -Pd  of  h  such  that  p,  =  ht,.  Since  £  is  a 
permutation  environment,  if  t,  ^  tj  then  hti  £  ht,.  Therefore,  the  D  prefixes  p<  are  pairwise 
inequivalent,  and  so  every  equivalence  class  is  represented  by  some  prefix  of  h.  Thus,  h  is  a 
diversity-based  distinguishing  sequence. 

As  in  the  last  section,  we  assume  a  diversity-based  homing  sequence  h  has  been  given,  and 
show  later  how  such  a  sequence  can  be  randomly  constructed. 

Our  procedure  is  given  in  Figure  11.  As  mentioned  above,  the  algorithm  maintains  various 
kinds  of  candidate  sets.  First,  for  each  0  <  i  <  j/i|,  a  set  G(i)  is  maintained  with  the  interpre¬ 
tation  that  j  is  in  G(i)  if  hj  could  plausibly  be  equivalent  to  h/i,,  i.e.,  if  it  has  not  yet  been 
determined  that  hj  ^  h/it.  (Thus,  G(i)  =  C(h<)  in  the  notation  of  Section  5-5.)  As  described 
above,  such  candidate  sets  are  useful  for  reaching  a  state  in  which  the  output  of  h  is  known 
prior  to  its  execution  from  that  state. 

The  algorithm  also  maintains  sets  U(i,b)  for  0  <  i  <  |/i|  and  b  €  B;  these  sets  consist  of 
indices  j  for  which  hj  is  plausibly  equivalent  to  6h,.  To  see  why  such  sets  might  be  useful, 
suppose  h  has  been  executed  from  some  state  q  with  output  a.  As  seen  before,  if  G(i)  is 
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Input:  access  to  £,  a  permutation  automaton 
h  -  a  diversity-based  homing  sequence 
Output:  a  perfect  model  of  £ 

Procedure: 

1  G(t),  l/(i,  b)  -  {0, . . . ,  \h\)  for  i  €  {0, . . . ,  \h\),  b  e  B 

2  execute  h,  producing  output  a 

3  repeat 

4  if  G(i)  is  incoherent  for  some  0  <  t  <  jh|  then 

5  execute  h,  producing  output  r 

6  G(i)+-G(i)rur~1(Ti)  forte  {0,...,|h|} 

7  er  *—  r 

8  else 

9  ft  <-  <r[G(j)]  for  i  e  {0, . . \h\} 

10  if  PLAN-EXP  can  find  a  shortest  useful  experiment  a6  then 

11  ♦-  3[U(i,a)\  for  i  €  {0,...,|h|} 

12  execute  ab 

13  execute  h.  producing  output  r 

14  Lr(i,6)*-  If(*\6)n  a-l(r<)  for  i  €  {0,...,|/i|} 

15  a  *—  t 

16  else 

17  exit  loop 

20  end 

21  let  q  be  the  current  state 

22  ft  -  <r[G(i)]  for  i  €  {0 . \h\} 

23  output  the  following  prediction  rule  (model  of  £): 

24  on  input  a  €  A,  predict  7 (qa)  =  /3[t/(0,a)] 


Figure  11:  A  diversity-based  algorithm  for  inferring  permutation  environment  £. 

coherent  for  all  i,  then  3  =  qh(h)  is  known,  the  output  of  h  if  executed  from  the  current  state 
qh.  If,  moreover,  U{i,b)  is  coherent  (with  respect  to  j3),  then  7 (qhbhi)  is  known;  thus,  if  this 
is  the  case  for  all  j,  then  qhb(h)  can  be  determined,  the  output  of  h  from  the  state  reached  if  b 
were  executed. 

The  function  U  can  be  extended  in  a  natural  manner  to  the  domain  {0, . . . ,  |h|}  x  A  by  the 
rule  U(i,X)  =  {«'}  and  U(i,ab)  =  for  i  e  {0, . . \h\},  a  e  A  and  b  e  B.  Then 

the  above  statements  also  hold  if  b  is  replaced  by  any  action  a  e  A. 

Our  algorithm  works  by  trying  to  reduce  the  candidate  sets  U(i,b )  as  much  as  possible  until 
U(i,a)  is  coherent  for  all  i  and  all  a  e  A;  at  this  point,  from  the  preceding  comments,  a  perfect 
model  has  been  attained. 

Let  <7,  3  and  q  be  as  above,  assuming  all  G(i)‘ s  are  coherent  with  respect  to  a.  If  U(i,b)  is 
incoherent  (with  respect  to  3),  then  executing  b  and  then  h  will  clearly  cause  some  candidate 
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set  U(i,b)  to  shrink.  In  this  case,  b  is  called  an  immediately  useful  experiment.  However,  it  may 
be  the  case  that  there  is  no  immediately  useful  experiment  (all  the  sets  U(i,b)  are  coherent) 
but,  nevertheless,  some  set  U(i,a)  is  incoherent  for  a  €  A.  so  that  a  perfect  model  has  not  been 
achieved.  In  this  case,  it  is  possible  to  find  a  useful  experiment ;  this  is  an  experiment  in  which 
a  “set-up”  action  a  €  A  is  first  executed  leading  to  a  state  in  which  an  immediately  useful 
experiment  can  be  executed. 

More  precisely,  a  sequence  ah,  where  a  6  A  and  b  6  B,  is  a  useful  experiment  if,  for  some 
0  <  i  <  |/i|,  U(i,o.b)  is  incoherent,  but  U(j,a)  is  coherent  for  j  €  U(i,b).  Note  that  the  shortest 
useful  experiment  has  the  additional  property  that  U(j,a)  is  coherent  for  all  j,  0  <  j  <  |h| 
(otherwise,  a  prefix  of  a  would  be  a  shorter  useful  experiment).  A  procedure  for  finding  a 
shortest  useful  experiment,  called  PLAN-EXP,  was  described  in  my  master’s  thesis  [79],  and 
is  treated  here  as  a  “black-box"’  subroutine.  (The  inputs  required  by  PLAN-EXP  are  omitted 
from  Figure  11,  but  are  described  fully  below.) 

Thus,  at  a  high  level,  our  algorithm  is  simple:  execute  h\  if  some  G(i )  is  incoherent,  then  re- 
execute  h  and  update  G;  otherwise,  find  and  execute  a  shortest  useful  experiment,  and  update 
U.  If  no  useful  experiment  exists,  then  a  perfect  model  has  been  found. 

Theorem  7.1  The  algorithm  described  in  Figure  11  halts  and  outputs  a  perfect  model  of  E 
after  executing  at  most  0(k D(\h\  +  D))  actions,  and  in  time  0(kD(\h\  +  D7  +  kD  ■  a(kD,  D))). 

Proof:  First,  note  that  because  of  the  manner  in  which  G  is  updated,  an  index  j  is  removed 
from  G{i)  only  if  h ^  hh{.  Thus,  since  ft  is  a  diversity- based  homing  sequence,  G(i)  is  never 
empty.  Also,  if  j  is  removed  from  G(i),  then  every  other  index  j'  for  which  hj  =  hj>  must  also 
be  removed  since  equivalent  tests  have  the  same  value  in  every  state. 

In  addition,  every  index  j  appears  in  some  set  G{i),  i.e.,  Ui  G(i)  =  {0, . . .,  \h\}.  To  see  that 
this  is  so,  note  that,  because  h  is  a  diversity-based  distinguishing  sequence,  every  equivalence 
class  is  represented  by  the  prefixes  of  h,  that  is,  [{[fi*]  :  0  <  i  <  |/i|}|  =  D.  Since  E  is  a  per¬ 
mutation  environment,  h,  =  hj  if  and  only  if  hhi  =  hhj .  Thus,  |{[/i/i<]  :  0  <  i  <  |h|}|  =  D. 
Therefore,  the  test  hj  is  equivalent  to  some  hhi,  implying  j  6  G(i). 

For  the  analysis,  it  is  important  to  note  that  the  set  {G(i)  :  0  <  i  <  |/i|}  is  a  partition  of 
{0, . . .,  j/ij}.  This  can  be  proved  by  an  inductive  argument:  For  suppose,  prior  to  the  execution 
of  line  6,  that  G(i)  and  G(j)  are  equal  or  disjoint,  for  some  i,j.  Then  if  G(i)  =  G(j )  and 
r,  =  Tj ,  then  clearly  G(t)  O  <r-1(Ti)  =  G(j)  f!  a-1(T;)-  On  the  other  hand,  if  G(i)  and  G(j )  are 
disjoint  or  if  r;  ^  r;,  then  G(i)  D  <7-1(r,)  and  G(j )  H  cr-1(r;)  must  also  be  disjoint.  In  either 
case,  the  new  sets  following  the  execution  of  line  6  will  be  equal  or  disjoint. 

Thus,  since  the  set  {0  <  t  <  \h\  :  h,  =  x}  respects  G(i)  for  any  test  x  on  each  iteration, 
it  follows  that  |{G(i)  :  0  <  i  <  \h\}\  <  D  on  each  iteration.  Since,  some  G(i)  shrinks  each  time 
that  lines  5-7  are  executed,  it  follows  that  this  block  is  executed  at  most  D  —  1  times. 
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We  would  like  to  give  a  similar  argument  showing  that  lines  9-17  are  executed  at  most 
k(D  -  1)  _imes.  We  will  first  give  an  inductive  proof  that  j  £  U(i,b )  only  if  h:  ^  bh,: 

Suppose  that  h  has  been  executed  from  q  with  output  a.  Suppose  also  that  each  G(i)  is 
coherent  with  respect  to  a.  Then  there  is  some  j  for  which  hj  =  hh, ,  and  which  is  therefore  in 
G{i).  Thus  7 {qhhi)  =  7 {qhj)  =  Oj  =  o[G(t)],  and  so  qh(h)  =  0  where  0,  =  o[G(i)],  as  in  the 
figure. 

Suppose  that  PLAN-EXP  returns  an  experiment  ab.  Then  U(i,a )  is  coherent  (with  respect 
to  0)  for  all  0  <  i  <  |A|,  but,  for  some  i,  U(i,  ab)  is  not.  Since  h  is  a  diversity-based  distinguishing 
sequence,  there  exists  j  for  which  hj  =  a/i*.  By  inductive  hypothesis,  j  £.  U(i,a).  Since  U(i,a) 
is  coherent,  we  have  a,  =  0[U(i,a)\  =  'y(qhhj)  =  7 (qhahi).  Thus,  a  =  qha(h). 

It  can  now  be  verified  that  j  is  removed  from  U(i,b )  only  if  hj  ^  bhi,  completing  the 
induction.  As  before,  this  implies  that  each  U(i,b)  is  nonempty  on  each  iteration,  and  that 
(J.  U(i,  b)  =  {0, ....  \h\}  on  each  iteration  for  each  b  £  B.  Also,  having  argued  that  a  =  qha(k) 
at  this  point  in  the  program,  it  can  now  be  argued  as  before  that  the  set  {i  :  h,  =  a-}  respects 
each  U(i,b),  and  that  the  set  {U(i,b)  :  0  <  i  <  |/i|}  is  a  partition  consisting  of  at  most  D  blocks 
for  each  b  £  B. 

Since  ab  is  a  useful  experiment,  some  set  l’(i,b)  must  shrink  at  line  14.  Thus,  by  the 
preceding  arguments,  lines  9-17  are  executed  at  most  k(D  -  1)  times. 

We  will  late,  argue  that  the  returned  useful  experiment  has  length  at  most  D.  This  will 
then  complete  the  proof  of  the  action  execution  bound. 

Given  the  above  arguments,  it  is  quite  easy  to  prove  the  correctness  of  the  output  prediction 
rule:  On  exiting  the  main  loop,  each  set  G(i)  or  f’(t,a)  is  coherent  (with  respect  to  a  and  0. 
respectively,  as  in  the  figure)  for  all  i  and  a  £  A.  As  argued  above,  in  the  current  state  q,  this 
implies  that  q(h)  =  0.  Also,  given  a  £  .4.  we  showed  above  that  7(90/?,)  =  0[U(i,  a)].  Thus, 
~[qa)  =  d[f/(0,a)l.  and  the  output  rule  is  a  perfect  model. 

Finally,  we  turn  to  efficiency  considerations.  If  naively  implemented,  the  running  time  of 
the  procedure  may  be  quite  poor.  However,  using  similar  techniques  to  those  described  in 
Section  5-5,  we  can  derive  a  time  bound  comparable  to  the  action  execution  bound. 

In  particular,  we  maintain  a  partition  n  over  the  set  {0,...,|/i|}  with  the  condition  that  i 
and  j  belong  to  the  same  block  of  n  if  and  only  if  the  values  of  and  h}  have  never  differed  on 
any  execution  of  h  (so  that  the  two  tests  are  plausibly  equivalent).  As  before,  if  /t,  =  hj,  then 
1  and  j  must  be  in  the  same  block  of  rr.  Thus.  1 7r |  <  D. 

It  is  easily  verified  that,  on  each  iteration,  if  i  and  i'  are  in  the  same  block  of  7r,  then 
G(i)  =  G(i'),  and  { r ,  r'}  respects  each  set  G(j).  Similarly,  for  6  €  B,  U(i,b)  =  U(i\b)  and 
respects  each  set  U(j,b).  Thus,  with  respect  to  the  data  structures  G  and  U ,  the  two 
indices  i  and  i'  are  entirely  indistinguishable.  Therefore,  we  can  represent  these  structures  more 
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efficiently  in  terms  of  the  blocks  of  x. 

In  particular,  as  was  done  in  Section  5-5,  we  can  represent  each  candidate  set  as  a  list  of 
pointers  to  those  blocks  of  x  which  it  includes.  Thus,  the  representation  of  such  a  set  has  size 
at  most  D.  Also,  since  G(i )  =  G(j)  if  i  and  j  are  in  the  same  block,  we  only  need  maintain 
a  candidate  set  for  a  single  member  of  each  block  (say,  the  minimum  element).  That  is,  we 
maintain  a  candidate  set  G{i)  or  U(i,b)  (explicitly  represented  as  described  above)  if  and  only 
if  *  is  the  smallest  member  of  its  block;  the  other  candidate  sets  are  only  implicitly  maintained, 
based  on  the  equalities  among  candidate  sets  described  above. 

With  such  a  representation,  lines  4,  6  and  14  take  only  time  0(D2).  Using  the  fact  (to 
be  proved)  that  |a6|  <  D,  we  can  also  show  that  line  11  takes  time  0(D2  +  jA|):  computing 
cti  =  0[U(i,a) ]  for  a  single  value  of  t  takes  0(D)  time  since  ja|  is  bounded,  and  since  U(i,a) 
is  known  to  be  coherent.  Thus,  computing  Oj  for  each  i  G  {min(s)  :  s  €  rr }  takes  0(D2)  time. 
Finally,  all  the  other  values  of  a,  can  be  computed  by  setting  a,  <—  amin(3)  f°r  s  €  tr,i  G  s  in 
0(|/i|)  time. 

The  partition  x  is  easily  maintained  in  the  same  manner  described  in  Section  5-5:  Each  time 
that  h  is  executed,  the  coherence  of  each  block  of  x  is  checked  in  0(|/i|)  time.  If  any  block  is 
incoherent,  then  the  structures  G  and  U  must  be  updated;  this  takes  0(kD2)  time.  Since  x  can 
be  partitioned  at  most  D  times,  this  adds  0(kD3)  to  the  total  running  time  of  the  procedure. 

It  remains  then  only  to  show  how  the  running  time  of  PLAN-EXP  can  be  bounded.  The 
procedure  PLAN-EXP  takes  as  input  a  set  V  of  variables;  a  set  of  candidate  sets  for  each  v  G  V, 
</  G  B\  and  an  assignment  to  the  variables  in  V.  It  returns  a  shortest  useful  experiment  ab 
(or  reports  that  none  exists)  in  time  0(fc|V'|  •  q(A|V'|,  |V|))  where  a  is  a  functional  inverse  of 
Ackermann's  function  [82].  The  length  of  the  returned  experiment  ab  is  bounded  by  |V'|. 

Thus,  if  we  use  {0 . \h\)  as  our  variable  set  in  our  call  to  PLAN-EXP,  then  the  procedure 

may  take  too  long,  and  could  plausibly  return  an  experiment  far  longer  than  D.  Instead,  we 
will  use  the  blocks  of  x  as  our  variable  set.  The  candidate  sets  are  then  defined  naturally  by 
the  rule  U'(s.b )  =  {s'  G  x  :  s'  C  U(m>n(5)’^)}  for  s  G  x  and  6  G  B.  The  assignment  3'(s)  is 
similarly  defined  to  be  /3(min(s)). 

Note  that  our  representation  scheme  for  U  is  essentially  equivalent  to  the  structure  U' ,  and 
the  structure  3'  is  easily  computed  in  0(D)  time.  Also,  since  \x\  <  D,  PLAN-EXP  runs  in 
time  0(kD  ■  a(kD,D)),  and  returns  an  experiment  of  length  at  most  D. 

It  can  be  argued  by  induction  on  the  length  of  a  that  U'(s.a )  =  {s'  G  x  :  s'  C  U(i,a)}  for 
s  G  x  and  a  G  A,  assuming  i  G  s.  With  this  fact,  it  can  be  seen  that  U'(s,a)  is  coherent  with 
respect  to  3'  if  and  only  if  U(i.a)  is  coherent  with  respect  to  3. 

In  particular,  this  shows  that  if  PLAN-EXP  when  called  in  this  manner  returns  an  experi¬ 
ment  ab.  then  U(i,a)  is  coherent  (with  respect  to  3)  for  all  0  <  i  <  |/i|,  but,  for  some  i,  U(i,ab) 
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is  not;  that  is,  ab  is  indeed  a  shortest  useful  experiment.  Likewise,  if  PLAN-EXP  fails  to  find 
a  useful  experiment,  then  each  U(i,a)  is  coherent  for  all  i  and  all  a  €  A. 

This  completes  the  proof.  ■ 

As  in  the  state-based  case,  we  can  construct  a  diversity- based  homing  sequence  by  choosing 
a  sufficiently  long  sequence  of  actions.  Below,  Hn  =  ]T"=1(l/t)  >s  the  nth  harmonic  number.  It 
is  well  known  that  Hn  =  0(log  n). 

Theorem  7.2  Let  6  >  0,  and  let  h  be  a  random  sequence  of  length  2 kD3Ho  •ln(.D)  -  In (D/6). 
Then  h  is  a  diversity-based  homing  sequence  with  probability  at  least  1  —  6. 

Proof:  We  follow  the  algorithm  of  Figure  4  for  constructing  a  diversity-based  homing  sequence. 
On  each  iteration,  we  need  to  find  an  extension  x  to  h  for  which  hx  is  inequivalent  to  every  prefix 
of  h.  That  is,  if  u  equivalence  classes  are  represented  by  the  prefixes  of  h,  and  [<a],  [<2],  •  •  •,  [to-ti] 
are  the  equivalence  classes  not  represented,  then  we  wish  to  find  x  such  that  hx  =  t,  for  some 
i.  Equivalently,  we  want  x  =  h~lti.  (Here,  h-1  is  a  sequence  of  actions  for  which  h~lh  is  the 
“identity”  action,  i.e.,  qh~1h  =  q  for  all  q  6  Q.  The  existence  of  h-1  is  guaranteed  by  the  fact 
that  £  is  a  permutation  environment.) 

Based  on  the  results  on  random  walks  given  in  my  master’s  thesis  [79],  it  is  easy  to  conclude 
the  following: 

Lemma  7,3  Let  t  be  any  test,  and  let  x  be  a  random  sequence  of  length  kD2ln(D)  of  the 
following  form:  At  each  step,  with  equal  probability,  we  either  do  nothing,  or  we  execute  a 
uniformly  and  randomly  chosen  basic  action  from  B.  Then  the  probability  that  t  =  x  is  at 
least  1/2  D. 

Thus,  the  probability  that  an  extension  x  as  described  above  will  be  equivalent  to  any 
h~lti  is  at  least  (D  -  v)/2D.  Extending  h  in  this  manner  (2 Dj{D  -  v))  ■  ln(l/<5)  times  gives 
a  probability  of  at  least  1  -  8  of  successfully  increasing  the  number  of  equivalence  classes 
represented  by  the  prefixes  of  h.  Replacing  6  with  8/ D,  we  can  conclude  that  h  is  a  homing 
sequence  with  probability  at  least  1  —  £  if  its  length  is  at  least 

D-i  2  n 

^  kD2  ln(.D)  ■  — - In  (D/8) 

'  U  —  V 

ti=0 

as  claimed.  (This  sequence  may  be  longer  than  strictly  necessary  since  v  may  increase  by  more 
than  one  with  each  extension;  also,  many  of  the  “actions”  required  by  the  lemma  are  actually 
“no-ops.”  This,  however,  does  not  affect  the  argument  since  a  homing  sequence  remains  one 
even  if  suffixed  or  prefixed.)  ■ 


5-8 


Experimental  results  179 


Thus,  our  inference  procedure  runs  in  time  0(k7DA  log2(Z?)  -log(Z)/£)).  This  improves  the 
previously  best-known  bound  of  0(k7D7  \og(D)  -  log (kD/6))  given  by  Rivest  and  Schapire  [73, 
79]  by  roughly  a  factor  of  D3/\og(D). 

5-8  Experimental  results 

The  algorithm  described  in  Section  5-4  has  been  implemented  and  tested  on  several  simple 
robot  environments. 

In  the  “Random  Graph”  environment,  the  robot  is  placed  on  a  randomly  generated  directed 
graph.  The  graph  has  n  vertices,  and  each  vertex  has  one  out-going  edge  labeled  with  each  of  the 
k  basic  actions.  For  each  vertex  i,  one  edge  (chosen  at  random)  is  directed  to  vertex  t  +  1  mod  n; 
this  ensures  that  the  graph  contains  a  Hamiltonian  cycle,  and  so  is  strongly  connected.  The 
other  edges  point  to  randomly  chosen  vertices,  and  the  output  of  each  vertex  is  also  chosen  at 
random. 

In  the  “Knight  Moves”  environment,  the  robot  is  placed  on  a  square  checker-board,  and 
can  make  any  of  the  legal  moves  of  a  chess  knight.  However,  if  the  robot  attempts  to  move  off 
the  board,  its  action  fails  and  no  movement  occurs.  The  robot  can  only  sense  the  color  of  the 
square  it  occupies.  Thus,  when  away  from  the  walls,  every  action  simply  inverts  the  robot’s 
current  sensation:  any  move  from  a  white  square  takes  the  robot  to  a  black  square,  and  vice 
versa.  This  makes  it  difficult  for  the  robot  to  orient  itself  in  this  environment. 

Finally,  in  the  “Crossword  Puzzle”  environment,  the  robot  is  on  a  crossword  puzzle  grid 
such  as  the  one  in  Figure  12.  The  robot  has  three  actions  available  to  it:  it  can  step  ahead  one 
square,  or  it  can  turn  left  or  right  by  90  degrees.  The  robot  can  only  occupy  the  white  squares 
of  the  crossword  puzzle:  an  attempt  to  move  onto  a  black  square  is  a  “no-op.”  Attempting  to 
step  beyond  the  boundaries  of  the  puzzle  is  also  a  no-op.  Each  of  the  four  “walls”  of  the  puzzle 
has  been  painted  a  different  color.  The  robot  looks  as  far  ahead  as  possible  in  the  direction  it 
faces:  if  its  view  is  obstructed  by  a  black  square,  then  it  sees  “black;”  otherwise,  it  sees  the  color 
of  the  wall  it  is  facing.  Thus,  the  robot  has  five  possible  sensations.  Since  this  environment  is 
essentially  a  maze,  it  may  contain  regions  which  are  difficult  to  reach  or  difficult  to  get  out  of. 

In  the  current  implementation,  we  have  used  an  adaptive  homing  sequence  or  homing  tree. 
We  have  also  used  the  modified  version  of  L *  described  in  Section  5-4.5.  Finally,  we  have 
implemented  a  heuristic  that  attempts  to  focus  effort  on  copies  of  Z*  that  have  already  made 
the  most  progress:  if  the  homing  sequence  is  executed  and  the  L’  copy  reached  is  not  very  far 
along,  then  the  procedure  is  likely  to  re-execute  the  homing  sequence  to  find  one  that  is  closer 
to  completion.  The  idea  of  the  heuristic  is  not  to  waste  time  on  copies  that  1.  ve  a  long  way  to 
go.  The  heuristic  seems  to  improve  the  running  time  for  these  three  environments  by  as  much 
as  a  factor  of  six. 
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Figure  12:  A  crossword  puzzle  environment. 

For  the  “Random  Graph”  and  “Crossword  Puzzle”  environments,  the  inference  procedure 
was  provided  in  some  experiments  with  an  oracle  which  would  return  the  shortest  counterexam¬ 
ple  to  an  incorrect  conjecture.  All  three  environments  were  also  tested  with  no  external  source 
of  counterexamples:  to  find  a  counterexample,  the  robot  would  instead  execute  random  actions 
until  its  model  of  the  environment  made  an  incorrect  prediction  of  the  output  of  some  state. 

Table  1  summarizes  how  our  procedure  handled  each  environment.  In  the  table.  “Source” 
refers  to  the  robot’s  source  of  counterexamples:  “S”  indicates  that  the  robot  had  access  to  the 
shortest  counterexample,  and  “R”  indicates  that  it  had  to  rely  on  random  walks.  The  column 
labeled  “|ran(7)|”  gives  the  number  of  possible  sensations  which  might  be  experienced  by  the 
robot.  (Extending  our  algorithms  to  the  case  that  the  range  of  7  consists  of  more  than  two 
elements  is  trivial.)  “Copies”  is  the  number  of  copies  of  L *  which  were  active  when  a  correct 
conjecture  was  made.  “Queries”  is  the  total  number  of  membership  and  equivalence  queries 
which  were  simulated.  “Actions”  is  the  total  number  of  actions  executed  by  the  robot,  and 
“Time”  is  elapsed  epu  time  in  minutes  and  seconds.  The  procedure  was  implemented  in  C 
on  a  DEC  MicroVax  III.  For  example,  inferring  the  8x8  “Knight  Moves”  environment  using 
randomly  generated  counterexamples  required  about  400.000  moves  and  19  seconds  of  epu  time. 

Note  that  for  the  “Random  Graph”  environment,  the  learning  procedure  sometimes  did 
better  with  randomly  generated  counterexamples  than  with  an  oracle  providing  the  shortest 
counterexample.  It  is  not  clear  why  this  is  so,  although  it  seems  plausible  that  in  some  way 
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Table  1:  Experimental  results. 


the  random  walk  sequences  give  more  information  about  the  environment.  For  example,  the 
counterexamples  often  become  subsequences  of  the  homing  sequence,  and  it  may  be  that  random 
walk  counterexamples  make  for  better,  more  distinguishing  homing  sequences. 

In  sum,  the  running  times  given  are  quite  fast,  and  the  number  of  moves  taken  far  less 
than  allowed  for  by  the  theoretical  worst-case  bounds.  Nevertheless,  it  is  also  true  that  the 
number  of  actions  executed  is  still  somewhat  large,  much  too  great  to  be  practical  for  a  real 
robot.  There  are  probably  many  ways  in  which  our  algorithm  might  be  improved  —  both  in  a 
theoretical  sense,  and  in  terms  of  heuristics  which  might  improve  the  performance  in  practice. 
We  leave  these  questions  as  open  problems. 


5-9  Conclusions  and  open  questions 

We  have  shown  how  to  infer  an  unknown  automaton,  in  the  absence  of  a  reset,  by  experimen¬ 
tation  and  with  counterexamples.  For  the  class  of  permutation  automata,  we  have  shown  that 
the  source  of  counterexamples  is  unnecessary.  We  have  described  polynomial-time  algorithms 
which  are  both  state-based  and  diversity-based. 

As  discussed  in  the  introduction,  these  results  represent  only  modest  progress  toward  our 
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ultimate  goal,  the  development  of  a  robot  capable  of  inferring  a  usable  model  of  its  real-world 
environment.  It  is  not  clear  how  to  get  there  from  where  we  are  now.  To  begin  with,  we  need 
algorithms  that  are  even  more  efficient  than  the  ones  described  here.  Perhaps  more  importantly, 
we  need  techniques  for  handling  more  realistic  environments.  These  would  include  environments 
exhibiting  various  kinds  of  randomness  or  uncertainty,  and  also  environments  with  infinitely 
many  states.  In  such  cases,  inference  of  a  perfect  model  will  almost  certainly  be  out  of  the 
question.  What  then  is  the  best  we  can  hope  for?  What  are  the  skills  most  needed  for  the 
robot  to  function  in  its  environment,  and  how  can  those  skills  be  learned? 
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